[jira] [Updated] (HUDI-1265) Improving bootstrap and efficient migration of existing non-Hudi dataset

Ethan Guo (Jira) Fri, 19 Aug 2022 07:36:04 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ethan Guo updated HUDI-1265:
----------------------------
    Description: 
This is an EPIC to revisit the logic of bootstrap for efficient migration of 
existing non-Hudi dataset, bridging any gaps with new features such as metadata 
table.

Here are the two modes of bootstrap and migration we suppose to support:
 # Onboard for new partitions alone: Given an existing non-Hudi partitioned 
dataset, Hudi manages new partitions under the same table path while keeping 
non-Hudi partitions untouched in place.  Query engine treats non-Hudi 
partitions differently when reading the data.  This works perfect for immutable 
data where there are no updates to old partitions and new data is only appended 
to the new partition.
 # Metadata-only and full-record bootstrap: Given an existing parquet dataset, 
Hudi generates the record-level metadata (Hudi meta columns) during the 
bootstrap process in a new table path different from the parquet dataset.  
There are two modes; they can be chosen at the granularity of partition in a 
single bootstrap action.  This unlocks the ability for Hudi to do upsert for 
all partitions.
 ## Metadata-only: generates record-level metadata only per parquet file 
without rewriting the actual data records. During query execution, the source 
data is merged with Hudi metadata to return the results.  This is the default 
mode.  
 ## Full-record: use bulk insert to generate record-level metadata, copy over 
and rewrite the source data with bulk insert.  During query execution, 
record-level metadata, i.e., meta columns, and the data columns are read from 
the same parquet, improving the read performance.

Phase 1: Testing and verification of status-quo (~1 week)

 

Phase 2: Functionality and correctness fix,  (3 weeks)

 

Phase 3: Performance (1~2 weeks)

 

  was:
This is an EPIC to revisit the logic of bootstrap for efficient migration of 
existing non-Hudi dataset, bridging any gaps with new features such as metadata 
table.

 

Phase 1: 

 


> Improving bootstrap and efficient migration of existing non-Hudi dataset
> ------------------------------------------------------------------------
>
>                 Key: HUDI-1265
>                 URL: https://issues.apache.org/jira/browse/HUDI-1265
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: bootstrap
>            Reporter: Balaji Varadarajan
>            Assignee: Ethan Guo
>            Priority: Blocker
>              Labels: hudi-umbrellas
>             Fix For: 0.13.0
>
>
> This is an EPIC to revisit the logic of bootstrap for efficient migration of 
> existing non-Hudi dataset, bridging any gaps with new features such as 
> metadata table.
> Here are the two modes of bootstrap and migration we suppose to support:
>  # Onboard for new partitions alone: Given an existing non-Hudi partitioned 
> dataset, Hudi manages new partitions under the same table path while keeping 
> non-Hudi partitions untouched in place.  Query engine treats non-Hudi 
> partitions differently when reading the data.  This works perfect for 
> immutable data where there are no updates to old partitions and new data is 
> only appended to the new partition.
>  # Metadata-only and full-record bootstrap: Given an existing parquet 
> dataset, Hudi generates the record-level metadata (Hudi meta columns) during 
> the bootstrap process in a new table path different from the parquet dataset. 
>  There are two modes; they can be chosen at the granularity of partition in a 
> single bootstrap action.  This unlocks the ability for Hudi to do upsert for 
> all partitions.
>  ## Metadata-only: generates record-level metadata only per parquet file 
> without rewriting the actual data records. During query execution, the source 
> data is merged with Hudi metadata to return the results.  This is the default 
> mode.  
>  ## Full-record: use bulk insert to generate record-level metadata, copy over 
> and rewrite the source data with bulk insert.  During query execution, 
> record-level metadata, i.e., meta columns, and the data columns are read from 
> the same parquet, improving the read performance.
> Phase 1: Testing and verification of status-quo (~1 week)
>  
> Phase 2: Functionality and correctness fix,  (3 weeks)
>  
> Phase 3: Performance (1~2 weeks)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-1265) Improving bootstrap and efficient migration of existing non-Hudi dataset

Reply via email to