[ 
https://issues.apache.org/jira/browse/HUDI-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-9137:
-----------------------------------------

    Assignee: sivabalan narayanan

> Improve how we determine spark partitions in bloom index based on record key 
> size instead of key per bucket config
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-9137
>                 URL: https://issues.apache.org/jira/browse/HUDI-9137
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: index
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 1.0.2
>
>
> As of now, the stage where we check the bloom filter and further lookup in 
> the file for actual match, num spark partitions is determined based on 
> {{hoodie.bloom.index.keys.per.bucket}} config. But in some cases, where the 
> record key could be large (200 bytes), this might result in too much of data 
> routed to each spark partition.
>  
> Ideally, we wanted to estimate the size of record keys and then determine the 
> spark partitions based on that. say each record key is 50 bytes, we could 
> route 2.5M keys to 1 spark task if we wanted to size it as 120Mb per spark 
> task.
>  
> If the record key size is 200 bytes, that would reduce it to 625k entries to 
> each spark task.
>  
> But getting hold of record key size may not be easy w/o collecting sample of 
> records in the driver.
> Alternatively, we can assume each column configured in record key field to be 
> 36 bytes (UUID size). And depending on how many columns configured, we could 
> derive some rough size w/o collecting the keys in the driver.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to