hudi-bot opened a new issue, #16029: URL: https://github.com/apache/hudi/issues/16029
This will help simplify SparkMetadataTableRecordIndex code which is hashing the keys into fileGroups itself. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-6388 - Type: Improvement --- ## Comments 15/Jun/23 18:00;pwason;In theory, SparkMetadataTableRecordIndex can simply call getRecordsByKeys() without having to split the keys into fileGroups. But that gets limited due to driver memory limitations as getRecordsByKeys() requires a list of keys rather than a RDD. Also SparkMetadataTableRecordIndex should not be hashing the keys as that is internal implementation of RI in MDT. So the current implementation of SparkMetadataTableRecordIndex is actually a perf fix as getRecordsByKeys() accepts a list of keys rather than a RDD (HoodieData). For large upserts, we cannot collect all the keys from incoming records onto the driver to pass to SparkMetadataTableRecordIndex.;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
