[
https://issues.apache.org/jira/browse/HUDI-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Kudinkin updated HUDI-1265:
----------------------------------
Fix Version/s: (was: 0.13.0)
> Efficient bootstrap and migration of existing non-Hudi dataset
> --------------------------------------------------------------
>
> Key: HUDI-1265
> URL: https://issues.apache.org/jira/browse/HUDI-1265
> Project: Apache Hudi
> Issue Type: Epic
> Components: bootstrap
> Reporter: Balaji Varadarajan
> Assignee: Ethan Guo
> Priority: Blocker
> Labels: hudi-umbrellas
>
> This is an EPIC to revisit the logic of bootstrap for efficient migration of
> existing non-Hudi dataset, bridging any gaps with new features such as
> metadata table.
> Here are the two modes of bootstrap and migration we suppose to support:
> # Onboard for new partitions alone: Given an existing non-Hudi partitioned
> dataset (/path/parquet), Hudi manages new partitions under the same table
> path (/path/parquet) while keeping non-Hudi partitions untouched in place.
> Query engine treats non-Hudi partitions differently when reading the data.
> This works perfect for immutable data where there are no updates to old
> partitions and new data is only appended to the new partition.
> # Metadata-only and full-record bootstrap: Given an existing parquet dataset
> (/path/parquet), Hudi generates the record-level metadata (Hudi meta columns)
> during the bootstrap process in a new table path (/path/parquet_hudi)
> different from the parquet dataset. There are two modes; they can be chosen
> at the granularity of partition in a single bootstrap action. This unlocks
> the ability for Hudi to do upsert for all partitions.
> ## Metadata-only: generates record-level metadata only per parquet file and
> a bootstrap index for mapping, without rewriting the actual data records.
> During query execution, the source data is merged with Hudi metadata to
> return the results. This is the default mode.
> ## Full-record: use bulk insert to generate record-level metadata, copy over
> and rewrite the source data with bulk insert. During query execution,
> record-level metadata, i.e., meta columns, and the data columns are read from
> the same parquet, improving the read performance.
> Important requirements:
> * Query engine integration: Spark, Hive, Presto/Trino
> * COW more important than MOR
> * Address performance degradation due to treating the entire table as
> bootstrap
> * Metadata table integration
> * Support source dataset with Hive-style partitioning
> * Support of non-Hudi partitions
> Phase 1: Testing and verification of status-quo (1~1.5 week)
> Writing:
> * Two migration modes above
> * COW and MOR
> * 1 additional commit after bootstrap doing upsert for metadata-only and
> full-record bootstrap
> * Spark datasource, Deltastreamer
> * Partitioned and non-partitioned table
> * Simple/complex key gen
> * Hive-style partition
> * w/ and w/o metadata table enabled
> * Meta sync
> Reading:
> * Hive QL, Spark SQL, Spark datasource, Presto/Trino
> * Snapshot, read-optimized, incremental query
> * Queries in the original query testing plan:
> [https://docs.google.com/spreadsheets/d/1xVfatk-6-fekwuCCZ-nTHQkewcHSEk89y-ReVV5vHQU/edit#gid=1813901684]
> Need to develop a validation tool for automated validation
> * Metadata, i.e., meta columns and index in metadata table, is properly
> populated
> * Data queried from Hudi table matches the parquet data
> Add tests when needed
> * HUDI-4125 Add integration tests around bootstrapped Hudi table
> Phase 2: Functionality and correctness fix, (2~3 weeks)
> Known and possible issues:
> * Spark cannot see non-Hudi partitions in first onboarding mode
> * Bootstrap Relation does not support MOR; HUDI-2071 Support Reading
> Bootstrap MOR RT Table In Spark DataSource Table
> * HUDI-915 Partition Columns missing in files upserted after Metadata
> Bootstrap
> * HUDI-992 For hive-style partitioned source data, partition columns synced
> with Hive will always have String type
> * HUDI-1369 Bootstrap Datasource jobs from hanging via spark-submit
> * HUDI-3122 Presto query failed for bootstrap tables
> * HUDI-1779 Fail to bootstrap/upsert a table which contains timestamp column
> Phase 3: Performance (1~2 weeks)
> * HUDI-1157 Optimization whether to query Bootstrapped table using
> HoodieBootstrapRelation vs Sparks Parquet datasource
> * HUDI-4453 Support partition pruning for tables Bootstrapped from Source
> Hive Style partitioned tables
> * HUDI-619 Avoid stitching meta columns and only load data columns for
> improving read performance
> * HUDI-1158 Optimizations in parallelized listing behaviour for markers and
> bootstrap source files
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)