hudi-bot opened a new issue, #16029:
URL: https://github.com/apache/hudi/issues/16029

   This will help simplify SparkMetadataTableRecordIndex code which is hashing 
the keys into fileGroups itself.
   
    
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-6388
   - Type: Improvement
   
   
   ---
   
   
   ## Comments
   
   15/Jun/23 18:00;pwason;In theory, SparkMetadataTableRecordIndex can simply 
call getRecordsByKeys() without having to split the keys into fileGroups. But 
that gets limited due to driver memory limitations as getRecordsByKeys() 
requires a list of keys rather than a RDD. Also SparkMetadataTableRecordIndex 
should not be hashing the keys as that is internal implementation of RI in MDT.
   
   So the current implementation of SparkMetadataTableRecordIndex is actually a 
perf fix as getRecordsByKeys() accepts a list of keys rather than a RDD 
(HoodieData). For large upserts, we cannot collect all the keys from incoming 
records onto the driver to pass to SparkMetadataTableRecordIndex.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to