geserdugarov commented on code in PR #12545:
URL: https://github.com/apache/hudi/pull/12545#discussion_r1908297277


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##########
@@ -207,11 +207,30 @@ public static DataStream<Object> append(
       Configuration conf,
       RowType rowType,
       DataStream<RowData> dataStream) {
-    WriteOperatorFactory<RowData> operatorFactory = 
AppendWriteOperator.getFactory(conf, rowType);
+    boolean isBucketIndex = OptionsResolver.isBucketIndexType(conf);
+    if (isBucketIndex) {

Review Comment:
   Currently, **insert operations** into COW and MOR lead to the same confusing 
file structure with separate parquet for each row.
   
   ### Behavior without changes
   
   For instance, I want to do two insert operations:
   
   ```SQL
   CREATE TABLE hudi_debug_mor_no_bucket (
       id INT,
       part INT,
       desc STRING,
       PRIMARY KEY (id) NOT ENFORCED
   ) 
   WITH (
       'connector' = 'hudi',
       'path' = 'hdfs://<some path>/hudi_debug_mor_no_bucket',
       'table.type' = 'MERGE_ON_READ',
       'write.operation' = 'insert'
   );
   
   INSERT INTO hudi_debug_mor_no_bucket VALUES 
       (1,100,'aaa'),
       (2,200,'bbb'),
       (3,300,'ccc'),
       (4,400,'ddd');
   
   INSERT INTO hudi_debug_mor_no_bucket VALUES 
       (5,500,'eee'),
       (6,600,'fff');
   ```
   
   In a result I got **new parquet file for each row**:
   ```Bash
   hdfs dfs -ls hdfs://<some path>/hudi_debug_mor_no_bucket
   Found 8 items
   drwxr-xr-x   - gdugarov supergroup          0 2025-01-09 15:38 hdfs://<some 
path>/hudi_debug_mor_no_bucket/.hoodie
   -rw-r--r--   3 gdugarov supergroup         96 2025-01-09 15:38 hdfs://<some 
path>/hudi_debug_mor_no_bucket/.hoodie_partition_metadata
   -rw-r--r--   3 gdugarov supergroup     433873 2025-01-09 15:38 hdfs://<some 
path>/hudi_debug_mor_no_bucket/1644d782-e5ea-435a-80a2-365dbcc77eb2-0_6-8-0_20250109153855066.parquet
   -rw-r--r--   3 gdugarov supergroup     433874 2025-01-09 15:39 hdfs://<some 
path>/hudi_debug_mor_no_bucket/30184188-88b3-4203-a842-17d119c7e567-0_1-8-0_20250109153920668.parquet
   -rw-r--r--   3 gdugarov supergroup     433870 2025-01-09 15:38 hdfs://<some 
path>/hudi_debug_mor_no_bucket/7277ffc5-f0ee-4c5f-a8f6-50eb5cc72e35-0_4-8-0_20250109153855066.parquet
   -rw-r--r--   3 gdugarov supergroup     433873 2025-01-09 15:38 hdfs://<some 
path>/hudi_debug_mor_no_bucket/c9e1b68a-fa14-44b1-8299-919a4083c8d0-0_5-8-0_20250109153855066.parquet
   -rw-r--r--   3 gdugarov supergroup     433871 2025-01-09 15:38 hdfs://<some 
path>/hudi_debug_mor_no_bucket/ca68de14-9124-4282-942d-2f871a67f1d9-0_3-8-0_20250109153855066.parquet
   -rw-r--r--   3 gdugarov supergroup     433875 2025-01-09 15:39 hdfs://<some 
path>/hudi_debug_mor_no_bucket/fb3ce28b-f4f4-4452-80a4-80ae430db3c9-0_2-8-0_20250109153920668.parquet
   ```
   
   ### With proposed changes
   
   We run the same SQL, but with added parameters:
   ```SQL
       'index.type'='BUCKET',
       'hoodie.bucket.index.hash.field'='id',
       'hoodie.bucket.index.num.buckets'='2'
   ```
   
   In a results we get expected 4 files:
   ```Bash
   hdfs dfs -ls hdfs://<some path>/hudi_debug_mor_bucket_supported
   Found 6 items
   drwxr-xr-x   - gdugarov supergroup          0 2025-01-09 15:44 hdfs://<some 
path>/hudi_debug_mor_bucket_supported/.hoodie
   -rw-r--r--   3 gdugarov supergroup         96 2025-01-09 15:44 hdfs://<some 
path>/hudi_debug_mor_bucket_supported/.hoodie_partition_metadata
   -rw-r--r--   3 gdugarov supergroup     433970 2025-01-09 15:44 hdfs://<some 
path>/hudi_debug_mor_bucket_supported/00000000-4604-43da-ad35-85d748591783_0-8-0_20250109154427505.parquet
   -rw-r--r--   3 gdugarov supergroup     433858 2025-01-09 15:44 hdfs://<some 
path>/hudi_debug_mor_bucket_supported/00000000-b79c-412c-b05f-1e83a2455ed7_0-8-0_20250109154431469.parquet
   -rw-r--r--   3 gdugarov supergroup     433972 2025-01-09 15:44 hdfs://<some 
path>/hudi_debug_mor_bucket_supported/00000001-480e-4db1-9b4a-188320116924_1-8-0_20250109154427505.parquet
   -rw-r--r--   3 gdugarov supergroup     433857 2025-01-09 15:44 hdfs://<some 
path>/hudi_debug_mor_bucket_supported/00000001-4ade-444a-b972-971910278ab1_1-8-0_20250109154431469.parquet
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to