[
https://issues.apache.org/jira/browse/HUDI-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-2285:
---------------------------------
Parent: HUDI-1292
Issue Type: Sub-task (was: Task)
> Metadata Table Synchronous Design
> ---------------------------------
>
> Key: HUDI-2285
> URL: https://issues.apache.org/jira/browse/HUDI-2285
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
> Labels: pull-request-available
>
> h2. *Motivation*
> HUDI Metadata Table version 1 (0.7 release) supports file-listing
> optimization. We intend to add support for additional information -
> record-level index(UUID), column indexes (column range index) to the metadata
> table. This requires re-architecting the table design for large scale
> (50billion+ records), synchronous operations and to reduce the reader-side
> overhead.
> # Limit the amount of sync requirement on the reader side
> # Syncing on reader side may negate the benefits of the secondary index
> # Not syncing on the reader-side simplifies design and reduces testing
> # Allow moving to a multi-writer design with operations running in separate
> pipelines
> # E.g. Clustering / Clean / Backfills in separate pipelines
> # Ease of debugging
> # Scale - Should be able to handle very large inserts - millions of keys,
> thousands of datafiles written
>
> h3. *Writer Side*
> The lifecycle of a HUDI operation will be as listed below. The example below
> shows COMMIT operation but the steps apply for all types of operations.
> # SparkHoodieWriteClient.commit(...) is called by ingestion process at time
> T1
> # Create requested instant t1.commit.requested
> # Create inflight instant t1.commit.inflight
> # Perform the write of RDD into the dataset and create the
> HoodieCommitMetadata
> # HoodieMetadataTableWriter.update(CommitMetadata, t1, WriteStatus)
> # This will perform a delta-commit into the HUDI Metadata Table updating the
> file listing, record-level index (future) and column indexes (future)
> together from the data collected in the WriteStatus.
> # This commit will complete before the commit started on the dataset will
> complete.
> # This will create the t1.deltacommit on the Metadata Table.
> # Since Metadata Table has inline clean and inline compaction, those
> additional operations may also take place at this time
> # Complete the commit by creating t1.commit
> Inline-compaction will only compact those log blocks which can be deemed
> readable as per the algorithm described in the reader-side in the next
> section.
> h3. *Reader Side*
> # List the dataset to find all completed instants - e.g. t1.commit,
> t2.commit … t10.commit
> # Since these instants are completed, their related metadata has already
> been written to the metadata table as part of respective deltacommits -
> t1.deltacommit, t2.deltacommit … t10.deltacommit
> # Find the last completed instant on the dataset - t10.commit
> # Open the FileSlice on the metadata partition with the following
> constraints:
> # Any base file with time > t10 cannot be used
> # Any log blocks whose timestamp is not in the list of completed instants
> (#1 above) cannot be used
> # Only in ingestion failure cases the latest base file (created due to
> compaction) or some log blocks may have to be neglected. In success cases,
> this process should not add extra overhead except for listing the dataset.
>
> h3. *Multi Write Support*
> Since each operation on metadata table writes to the same files (file-listing
> partition has a single FileSlice), we can only allow single-writer access to
> the metadata table. For this, the Transaction Manager is used to lock the
> table before any updates.
> In essence, each multi-writer operation will contend for the same lock to
> write updates to the metadata table before the operation completes. This may
> not even be an issue in reality as the operations will complete at different
> times and the metadata table operations should be fast.
>
> *Upgrade/Downgrade*
> The two versions (current and this new one) differ in schema and its
> complicated to check whether the table is in sync. So its simpler to
> re-bootstrap as its only the file listing which needs to be re-bootstrapped.
> h3. *Support for shards in metadata table partitions.*
> 1. There will be fixed number of shards for each Metadata Table partition.
> 2. Shards are implemented using filenames of format fileId00ABCD where ABCD
> is the shard number. This allows easy identification of the files and their
> order while still keeping the names unique.
> 3. Shards are pre-allocation during the time of bootstrap.
> 4. Currently only files partition has 1 shard. But this code is required for
> record-level-index so implemented here.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)