[
https://issues.apache.org/jira/browse/HUDI-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivabalan narayanan reassigned HUDI-9137:
-----------------------------------------
Assignee: sivabalan narayanan
> Improve how we determine spark partitions in bloom index based on record key
> size instead of key per bucket config
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-9137
> URL: https://issues.apache.org/jira/browse/HUDI-9137
> Project: Apache Hudi
> Issue Type: Improvement
> Components: index
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Fix For: 1.0.2
>
>
> As of now, the stage where we check the bloom filter and further lookup in
> the file for actual match, num spark partitions is determined based on
> {{hoodie.bloom.index.keys.per.bucket}} config. But in some cases, where the
> record key could be large (200 bytes), this might result in too much of data
> routed to each spark partition.
>
> Ideally, we wanted to estimate the size of record keys and then determine the
> spark partitions based on that. say each record key is 50 bytes, we could
> route 2.5M keys to 1 spark task if we wanted to size it as 120Mb per spark
> task.
>
> If the record key size is 200 bytes, that would reduce it to 625k entries to
> each spark task.
>
> But getting hold of record key size may not be easy w/o collecting sample of
> records in the driver.
> Alternatively, we can assume each column configured in record key field to be
> 36 bytes (UUID size). And depending on how many columns configured, we could
> derive some rough size w/o collecting the keys in the driver.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)