[jira] [Commented] (HUDI-6388) BaseTableMetadata::getRecordByKeys should accept HoodieData when a very large number of keys are to be looked up.

Prashant Wason (Jira) Thu, 15 Jun 2023 11:01:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733190#comment-17733190
 ]


Prashant Wason commented on HUDI-6388:
--------------------------------------

In theory, SparkMetadataTableRecordIndex can simply call getRecordsByKeys() 
without having to split the keys into fileGroups. But that gets limited due to 
driver memory limitations as getRecordsByKeys() requires a list of keys rather 
than a RDD. Also SparkMetadataTableRecordIndex should not be hashing the keys 
as that is internal implementation of RI in MDT.

So the current implementation of SparkMetadataTableRecordIndex is actually a 
perf fix as getRecordsByKeys() accepts a list of keys rather than a RDD 
(HoodieData). For large upserts, we cannot collect all the keys from incoming 
records onto the driver to pass to SparkMetadataTableRecordIndex.

> BaseTableMetadata::getRecordByKeys should accept HoodieData when a very large 
> number of keys are to be looked up.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-6388
>                 URL: https://issues.apache.org/jira/browse/HUDI-6388
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Priority: Major
>              Labels: release-0.14.0-blocker
>
> This will help simplify SparkMetadataTableRecordIndex code which is hashing 
> the keys into fileGroups itself.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6388) BaseTableMetadata::getRecordByKeys should accept HoodieData when a very large number of keys are to be looked up.

Reply via email to