sivabalan narayanan created HUDI-1083:
-----------------------------------------

             Summary: Minor optimization in Determining insert bucket location 
for a given key
                 Key: HUDI-1083
                 URL: https://issues.apache.org/jira/browse/HUDI-1083
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Writer Core
            Reporter: sivabalan narayanan


As of now, this is how bucket for a given key is determined.

In every partition, we find all insert buckets and assign weights. 

for eg: 0.2, 0.3, 0.5 for a given partition with 100 records to be inserted 
means, 20 will go into B0, 30 will go into B1 and 50 will go into B2.

within getPartition(Object key), we linearly walk through the bucket weights 
and find the right bucket for a given key. for instance if mod (hash value) is 
90/100 = 0.9, we keep adding the bucket weights until the value exceeds 0.9.

Instead we could calculate cumulative weights upfront and do a binary search 
within getPartition()

so, 0.2, 0.5, 1

so with mod(hash value), we could do binary search and find the right bucket 
and would cut cost from O(N) to log N. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to