[
https://issues.apache.org/jira/browse/HUDI-6388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733190#comment-17733190
]
Prashant Wason commented on HUDI-6388:
--------------------------------------
In theory, SparkMetadataTableRecordIndex can simply call getRecordsByKeys()
without having to split the keys into fileGroups. But that gets limited due to
driver memory limitations as getRecordsByKeys() requires a list of keys rather
than a RDD. Also SparkMetadataTableRecordIndex should not be hashing the keys
as that is internal implementation of RI in MDT.
So the current implementation of SparkMetadataTableRecordIndex is actually a
perf fix as getRecordsByKeys() accepts a list of keys rather than a RDD
(HoodieData). For large upserts, we cannot collect all the keys from incoming
records onto the driver to pass to SparkMetadataTableRecordIndex.
> BaseTableMetadata::getRecordByKeys should accept HoodieData when a very large
> number of keys are to be looked up.
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-6388
> URL: https://issues.apache.org/jira/browse/HUDI-6388
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Prashant Wason
> Priority: Major
> Labels: release-0.14.0-blocker
>
> This will help simplify SparkMetadataTableRecordIndex code which is hashing
> the keys into fileGroups itself.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)