Alexey Kudinkin created HUDI-5534:
-------------------------------------
Summary: Optimize Bloom Index lookup DAG
Key: HUDI-5534
URL: https://issues.apache.org/jira/browse/HUDI-5534
Project: Apache Hudi
Issue Type: Improvement
Components: writer-core
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
Fix For: 0.13.0
There are some low-hanging performance optimizations that could considerably
improve performance of the Bloom Index lookup seq:
# Map file-comparison pairs to PairRDD (where key is file-name, and value is
record-key) instead of RDD, this would allow us to
## Do sorting by filename (to make sure we check all records w/in the file all
at once) w/in a single Spark partition instead of global one (reducing
shuffling as well)
## Avoid re-shuffling (by re-mapping from RDD to PairRDD later)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)