[ https://issues.apache.org/jira/browse/HUDI-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
leesf updated HUDI-1083: ------------------------ Fix Version/s: 0.6.1 > Minor optimization in Determining insert bucket location for a given key > ------------------------------------------------------------------------ > > Key: HUDI-1083 > URL: https://issues.apache.org/jira/browse/HUDI-1083 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core > Reporter: sivabalan narayanan > Assignee: shenh062326 > Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > As of now, this is how bucket for a given key is determined. > In every partition, we find all insert buckets and assign weights. > for eg: 0.2, 0.3, 0.5 for a given partition with 100 records to be inserted > means, 20 will go into B0, 30 will go into B1 and 50 will go into B2. > within getPartition(Object key), we linearly walk through the bucket weights > and find the right bucket for a given key. for instance if mod (hash value) > is 90/100 = 0.9, we keep adding the bucket weights until the value exceeds > 0.9. > Instead we could calculate cumulative weights upfront and do a binary search > within getPartition() > so, 0.2, 0.5, 1 > so with mod(hash value), we could do binary search and find the right bucket > and would cut cost from O(N) to log N. > -- This message was sent by Atlassian Jira (v8.3.4#803005)