wombatu-kun commented on code in PR #12120:
URL: https://github.com/apache/hudi/pull/12120#discussion_r1811932730


##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/index/bucket/TestBucketIdentifier.java:
##########
@@ -121,9 +119,8 @@ public void testGetHashKeys() {
     assertEquals("bcd", keys.get(1));
 
     keys = identifier.getHashKeys(new HoodieKey("f1:abc,f2:bcd,efg", 
"partition"), "f1,f2");
-    assertEquals(3, keys.size());
+    assertEquals(2, keys.size());
     assertEquals("abc", keys.get(0));
-    assertEquals("bcd", keys.get(1));
-    assertEquals("efg", keys.get(2));
+    assertEquals("bcd,efg", keys.get(1));

Review Comment:
   This new behavior is correct: values for hashing fields `f1,f2` should be 
`["abc","bcd,efg"]`. Previous values - `["abc","bcd","efg"]` are incorrect, but 
users may have tables with data already written to "wrong" buckets. When they 
update Hudi and continue writing their tables, some record keys may be written 
to different bucket.  
   @danny0405 Danny hi! Is it critical? is it a common case? Could you join 
this review please and decide should function `extractRecordKeysByFields` be 
fixed or not?
   Memory allocation optimization is great, but i don't know how to deal with 
this potentially "breaking" change of bucketing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to