sivabalan narayanan created HUDI-1082:
-----------------------------------------

             Summary: Bug in deciding the upsert/insert buckets
                 Key: HUDI-1082
                 URL: https://issues.apache.org/jira/browse/HUDI-1082
             Project: Apache Hudi
          Issue Type: Bug
          Components: Writer Core
    Affects Versions: 0.6.0
            Reporter: sivabalan narayanan


In 
[UpsertPartitioner|[https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]],
 when getPartition(Object key) is called, the logic to determine where the 
record to be inserted is relying on globalInsertCounts where as this should be 
perPartitionInsertCount.

 

Bcoz, the weights for all targetInsert buckets are determined based on total 
Inserts going into the partition of interest. // check like 200. Whereas when 
getPartition(key) is called, we use global insert count to determine the right 
bucket. 

 

For instance,

P1: 3 insert buckets with weights 0.2, 0.5 and 0.3 with total records to be 
inserted is 100.

P2: 4 bucket with weights 0.1, 0.8, 0.05, 0.05 with total records to be 
inserted is 10025. 

So, ideal allocation is

P1: B0 -> 20, B1 -> 50, B2 -> 30

P2: B0 -> 1002, B1 -> 8020, B2 -> 502, B3 -> 503

 

getPartition() for a key is determined based on following.

mod (hash value/inserts)/ inserts.

Instead of considering inserts for the partition of interest, currently we take 
global insert counts.

Lets say, these are the hash values for insert records in P1.

5, 14, 20, 25, 90, 500, 1001, 5180.

record hash | expected bucket in P1 | actual bucket in P1 |

5     | B0 | B0

14   | B0 | B0

21   | B1  | B0

30  | B1 | B0

90 | B2 | B0

500 | B0 | B0

1490 | B2 | B1

10019 | B0 | B3

 

 

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to