wombatu-kun commented on code in PR #12120:
URL: https://github.com/apache/hudi/pull/12120#discussion_r1811932730
##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/index/bucket/TestBucketIdentifier.java:
##########
@@ -121,9 +119,8 @@ public void testGetHashKeys() {
assertEquals("bcd", keys.get(1));
keys = identifier.getHashKeys(new HoodieKey("f1:abc,f2:bcd,efg",
"partition"), "f1,f2");
- assertEquals(3, keys.size());
+ assertEquals(2, keys.size());
assertEquals("abc", keys.get(0));
- assertEquals("bcd", keys.get(1));
- assertEquals("efg", keys.get(2));
+ assertEquals("bcd,efg", keys.get(1));
Review Comment:
This new behavior is correct: values for hashing fields `f1,f2` should be
`["abc","bcd,efg"]`. Previous values - `["abc","bcd","efg"]` are incorrect, but
users may have tables with data already written to "wrong" buckets. When they
update Hudi and continue writing their tables, some record keys may be written
to different bucket.
@danny0405 Danny hi! Is it critical? is it a common case? Could you join
this review please and decide should function `extractRecordKeysByFields` be
fixed or not?
Memory allocation optimization is great, but i don't know how to deal with
this potentially "breaking" change of bucketing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]