sivabalan narayanan created HUDI-1083:
-----------------------------------------
Summary: Minor optimization in Determining insert bucket location
for a given key
Key: HUDI-1083
URL: https://issues.apache.org/jira/browse/HUDI-1083
Project: Apache Hudi
Issue Type: Improvement
Components: Writer Core
Reporter: sivabalan narayanan
As of now, this is how bucket for a given key is determined.
In every partition, we find all insert buckets and assign weights.
for eg: 0.2, 0.3, 0.5 for a given partition with 100 records to be inserted
means, 20 will go into B0, 30 will go into B1 and 50 will go into B2.
within getPartition(Object key), we linearly walk through the bucket weights
and find the right bucket for a given key. for instance if mod (hash value) is
90/100 = 0.9, we keep adding the bucket weights until the value exceeds 0.9.
Instead we could calculate cumulative weights upfront and do a binary search
within getPartition()
so, 0.2, 0.5, 1
so with mod(hash value), we could do binary search and find the right bucket
and would cut cost from O(N) to log N.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)