[jira] [Created] (HUDI-9137) Improve how we determine spark partitions in bloom index based on record key size instead of key per bucket config

sivabalan narayanan (Jira) Fri, 07 Mar 2025 14:36:01 -0800

sivabalan narayanan created HUDI-9137:
-----------------------------------------


             Summary: Improve how we determine spark partitions in bloom index 
based on record key size instead of key per bucket config
                 Key: HUDI-9137
                 URL: https://issues.apache.org/jira/browse/HUDI-9137
             Project: Apache Hudi
          Issue Type: Improvement
          Components: index
            Reporter: sivabalan narayanan


As of now, the stage where we check the bloom filter and further lookup in the 
file for actual match, num spark partitions is determined based on 
{{hoodie.bloom.index.keys.per.bucket}} config. But in some cases, where the 
record key could be large (200 bytes), this might result in too much of data 
routed to each spark partition.
 
Ideally, we wanted to estimate the size of record keys and then determine the 
spark partitions based on that. say each record key is 50 bytes, we could route 
2.5M keys to 1 spark task if we wanted to size it as 120Mb per spark task.
 
If the record key size is 200 bytes, that would reduce it to 625k entries to 
each spark task.
 
But getting hold of record key size may not be easy w/o collecting sample of 
records in the driver.
Alternatively, we can assume each column configured in record key field to be 
36 bytes (UUID size). And depending on how many columns configured, we could 
derive some rough size w/o collecting the keys in the driver.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HUDI-9137) Improve how we determine spark partitions in bloom index based on record key size instead of key per bucket config

Reply via email to