[jira] [Updated] (HUDI-6098) Initial commit in MDT should use bulk insert for performance

ASF GitHub Bot (Jira) Tue, 18 Apr 2023 12:52:11 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HUDI-6098:
---------------------------------
    Labels: pull-request-available  (was: )

> Initial commit in MDT should use bulk insert for performance
> ------------------------------------------------------------
>
>                 Key: HUDI-6098
>                 URL: https://issues.apache.org/jira/browse/HUDI-6098
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>
> Initial commit into MDT writes a very large number of records. With indexes 
> like record-index (to be comitted) the number of written records is in the 
> order of the total number of records in the dataset itself (could be in 
> billions). 
> If we use upsertPrepped to initialize the indexes then:
>  # The initial commit will write data into log files
>  # due to the large amount of data the write will be split into a very large 
> number of log blocks
>  # performance of lookups from the MDT will suffer greatly until a compaction 
> is run
>  # compaction will take all the log data and write into base files (HFiles) 
> doubling the read/write IO
> By directly writing the initial commit into base files using 
> bulkInsertPrepped API, we can remove all the issues listed above.
> This is a critical requirement for large scale indexes like record index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6098) Initial commit in MDT should use bulk insert for performance

Reply via email to