Prashant Wason created HUDI-6098:
------------------------------------
Summary: Initial commit in MDT should use bulk insert for
performance
Key: HUDI-6098
URL: https://issues.apache.org/jira/browse/HUDI-6098
Project: Apache Hudi
Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason
Initial commit into MDT writes a very large number of records. With indexes
like record-index (to be comitted) the number of written records is in the
order of the total number of records in the dataset itself (could be in
billions).
If we use upsertPrepped to initialize the indexes then:
# The initial commit will write data into log files
# due to the large amount of data the write will be split into a very large
number of log blocks
# performance of lookups from the MDT will suffer greatly until a compaction
is run
# compaction will take all the log data and write into base files (HFiles)
doubling the read/write IO
By directly writing the initial commit into base files using bulkInsertPrepped
API, we can remove all the issues listed above.
This is a critical requirement for large scale indexes like record index.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)