[ https://issues.apache.org/jira/browse/HUDI-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
leesf closed HUDI-1082. ----------------------- > Bug in deciding the upsert/insert buckets > ----------------------------------------- > > Key: HUDI-1082 > URL: https://issues.apache.org/jira/browse/HUDI-1082 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core > Affects Versions: 0.6.0 > Reporter: sivabalan narayanan > Assignee: shenh062326 > Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > > In > [UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]], > when getPartition(Object key) is called, the logic to determine where the > record to be inserted is relying on globalInsertCounts where as this should > be perPartitionInsertCount. > > Bcoz, the weights for all targetInsert buckets are determined based on total > Inserts going into the partition of interest. // check like 200. Whereas when > getPartition(key) is called, we use global insert count to determine the > right bucket. > > For instance, > P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be > inserted is 100. > P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be > inserted is 10025. > So, ideal allocation is > P1: B0 -> 20, B1 -> 50, B2 -> 30 > P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503 > > getPartition() for a key is determined based on following. > mod (hash value, inserts)/ inserts. > Instead of considering inserts for the partition of interest, currently we > take global insert counts. > Lets say, these are the hash values for insert records in P1. > 5, 14, 20, 25, 90, 500, 1001, 5180. > record hash | expected bucket in P1 | actual bucket in P1 | > 5 | B0 | B0 > 14 | B0 | B0 > 21 | B1 | B0 > 30 | B1 | B0 > 90 | B2 | B0 > 500 | B0 | B0 > 1490 | B2 | B1 > 10019 | B0 | B3 > > > > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)