Alexey Kudinkin created HUDI-5534:
-------------------------------------

             Summary: Optimize Bloom Index lookup DAG
                 Key: HUDI-5534
                 URL: https://issues.apache.org/jira/browse/HUDI-5534
             Project: Apache Hudi
          Issue Type: Improvement
          Components: writer-core
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin
             Fix For: 0.13.0


There are some low-hanging performance optimizations that could considerably 
improve performance of the Bloom Index lookup seq:
 # Map file-comparison pairs to PairRDD (where key is file-name, and value is 
record-key) instead of RDD, this would allow us to 
 ## Do sorting by filename (to make sure we check all records w/in the file all 
at once) w/in a single Spark partition instead of global one (reducing 
shuffling as well)
 ## Avoid re-shuffling (by re-mapping from RDD to PairRDD later)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to