sivabalan narayanan created HUDI-9137:
-----------------------------------------
Summary: Improve how we determine spark partitions in bloom index
based on record key size instead of key per bucket config
Key: HUDI-9137
URL: https://issues.apache.org/jira/browse/HUDI-9137
Project: Apache Hudi
Issue Type: Improvement
Components: index
Reporter: sivabalan narayanan
As of now, the stage where we check the bloom filter and further lookup in the
file for actual match, num spark partitions is determined based on
{{hoodie.bloom.index.keys.per.bucket}} config. But in some cases, where the
record key could be large (200 bytes), this might result in too much of data
routed to each spark partition.
Ideally, we wanted to estimate the size of record keys and then determine the
spark partitions based on that. say each record key is 50 bytes, we could route
2.5M keys to 1 spark task if we wanted to size it as 120Mb per spark task.
If the record key size is 200 bytes, that would reduce it to 625k entries to
each spark task.
But getting hold of record key size may not be easy w/o collecting sample of
records in the driver.
Alternatively, we can assume each column configured in record key field to be
36 bytes (UUID size). And depending on how many columns configured, we could
derive some rough size w/o collecting the keys in the driver.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)