[I] [SUPPORT] Wrong hash for buckets [hudi]

via GitHub Thu, 24 Oct 2024 03:58:49 -0700


geserdugarov opened a new issue, #12155:
URL: https://github.com/apache/hudi/issues/12155


   Current extraction of column values from key record for bucket hash is 
working wrong:
   
https://github.com/apache/hudi/blob/a7512a206c5a1e8ce251cac7a302632a57d8c848/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java#L135-L139
   And it was done by #7342 . 
   
   **Describe the problem you faced**
   
   We shouldn't just split by `,` and then result by `:`. It's a wrong 
algorithm.
   
   In this case, if we have keys from timestamp only:
   ```SQL
   DROP TABLE IF EXISTS hudi_ts_only_one_bucket;
   
   CREATE TABLE hudi_ts_only_one_bucket (
       ts STRING,
       desc STRING,
       PRIMARY KEY (ts) NOT ENFORCED
   ) 
   WITH (
       'connector' = 'hudi',
       'path' = 'hdfs://some_path/hudi_ts_only_one_bucket',
       'table.type' = 'COPY_ON_WRITE',
       'write.operation' = 'upsert',
       'index.type'='BUCKET',
       'hoodie.bucket.index.hash.field'='ts',
       'hoodie.bucket.index.num.buckets'='3',
       'write.tasks'='3',
       'read.tasks'='3'
   );
   
   INSERT INTO hudi_ts_only_one_bucket VALUES 
       ('2024-10-24 10:11:11.234','aa'),
       ('2024-10-24 10:22:12.345','aa'),
       ('2024-10-24 10:33:13.456','aa'),
       ('2024-10-24 10:44:14.567','aa'),
       ('2024-10-24 10:55:15.678','aa');
   ```
   
   then all records will be placed in one bucket always despite any settings:
   ```Bash
   hdfs dfs -ls -h hdfs://some_path/hudi_ts_only_one_bucket
   Found 3 items
   .../hudi_ts_only_one_bucket/.hoodie
   .../hudi_ts_only_one_bucket/.hoodie_partition_metadata
   
.../hudi_ts_only_one_bucket/00000001-4a48-44f3-a3fa-63c7a537e4a1_1-3-0_20241024183249279.parquet
   ```
   
   Or exactly opposite could happen. You want to place some data to one bucket 
by some field (`f2`):
   ```SQL
   DROP TABLE IF EXISTS hudi_tricky_one_bucket;
   
   CREATE TABLE hudi_tricky_one_bucket (
       f1 STRING,
       f2 STRING,
       desc STRING,
       PRIMARY KEY (f1,f2) NOT ENFORCED
   ) 
   WITH (
       'connector' = 'hudi',
       'path' = 'hdfs://some_path/hudi_tricky_one_bucket',
       'table.type' = 'COPY_ON_WRITE',
       'write.operation' = 'upsert',
       'index.type'='BUCKET',
       'hoodie.bucket.index.hash.field'='',
       'hoodie.bucket.index.num.buckets'='3',
       'write.tasks'='3',
       'read.tasks'='3'
   );
   
   INSERT INTO hudi_tricky_one_bucket VALUES 
       ('101,010','forceToOneBucket','aa'),
       ('122,120','forceToOneBucket','aa'),
       ('123,410','forceToOneBucket','aa');
   ```
   
   but have a lot of buckets in a result:
   ```Bash
   hdfs dfs -ls -h hdfs://some_path/hudi_should_be_one_but_got_a_lot
   Found 5 items
   .../hudi_should_be_one_but_got_a_lot/.hoodie
   .../hudi_should_be_one_but_got_a_lot/.hoodie_partition_metadata
   
.../hudi_should_be_one_but_got_a_lot/00000000-b2fc-4e53-babd-10ee343463be_0-3-0_20241024184049044.parquet
   
.../hudi_should_be_one_but_got_a_lot/00000001-937b-409b-a2d1-e09b73c332ea_1-3-0_20241024184049044.parquet
   
.../hudi_should_be_one_but_got_a_lot/00000002-925f-4724-aaba-d32d9f8b526e_2-3-0_20241024184049044.parquet
   ```
   
   **Expected behavior**
   
   Customers expect for the case with timestamps to see as much buckets, as set 
by `hoodie.bucket.index.num.buckets`. And in the second case, it's expected 
that if `f2` has similar data, and it's used for bucketing, then we should have 
this data in similar bucket.
   
   I have a fix, it's presented in https://github.com/apache/hudi/pull/12120. 
But the problem is with migration from current wrong hash to new one.
   
   **Environment Description**
   
   * Hudi version : current master, used 
9b3f85e1b4d037a6e10c81e4b8d5f3e8a4a01ef6
   
   * Flink version : 1.17
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Wrong hash for buckets [hudi]

Reply via email to