[
https://issues.apache.org/jira/browse/HUDI-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-6098:
---------------------------------
Labels: pull-request-available (was: )
> Initial commit in MDT should use bulk insert for performance
> ------------------------------------------------------------
>
> Key: HUDI-6098
> URL: https://issues.apache.org/jira/browse/HUDI-6098
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
> Labels: pull-request-available
>
> Initial commit into MDT writes a very large number of records. With indexes
> like record-index (to be comitted) the number of written records is in the
> order of the total number of records in the dataset itself (could be in
> billions).
> If we use upsertPrepped to initialize the indexes then:
> # The initial commit will write data into log files
> # due to the large amount of data the write will be split into a very large
> number of log blocks
> # performance of lookups from the MDT will suffer greatly until a compaction
> is run
> # compaction will take all the log data and write into base files (HFiles)
> doubling the read/write IO
> By directly writing the initial commit into base files using
> bulkInsertPrepped API, we can remove all the issues listed above.
> This is a critical requirement for large scale indexes like record index.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)