Prashant Wason created HUDI-6098:
------------------------------------

             Summary: Initial commit in MDT should use bulk insert for 
performance
                 Key: HUDI-6098
                 URL: https://issues.apache.org/jira/browse/HUDI-6098
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason
            Assignee: Prashant Wason


Initial commit into MDT writes a very large number of records. With indexes 
like record-index (to be comitted) the number of written records is in the 
order of the total number of records in the dataset itself (could be in 
billions). 

If we use upsertPrepped to initialize the indexes then:
 # The initial commit will write data into log files
 # due to the large amount of data the write will be split into a very large 
number of log blocks
 # performance of lookups from the MDT will suffer greatly until a compaction 
is run
 # compaction will take all the log data and write into base files (HFiles) 
doubling the read/write IO

By directly writing the initial commit into base files using bulkInsertPrepped 
API, we can remove all the issues listed above.

This is a critical requirement for large scale indexes like record index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to