geserdugarov commented on code in PR #12545:
URL: https://github.com/apache/hudi/pull/12545#discussion_r1908297277
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##########
@@ -207,11 +207,30 @@ public static DataStream<Object> append(
Configuration conf,
RowType rowType,
DataStream<RowData> dataStream) {
- WriteOperatorFactory<RowData> operatorFactory =
AppendWriteOperator.getFactory(conf, rowType);
+ boolean isBucketIndex = OptionsResolver.isBucketIndexType(conf);
+ if (isBucketIndex) {
Review Comment:
For now, **insert operations** into COW and MOR lead to the same file
structure, and **completely broken**.
### Behavior without changes
For instance, I want to do two insert operations:
```SQL
CREATE TABLE hudi_debug_mor_no_bucket (
id INT,
part INT,
desc STRING,
PRIMARY KEY (id) NOT ENFORCED
)
WITH (
'connector' = 'hudi',
'path' = 'hdfs://<some path>/hudi_debug_mor_no_bucket',
'table.type' = 'MERGE_ON_READ',
'write.operation' = 'insert'
);
INSERT INTO hudi_debug_mor_no_bucket VALUES
(1,100,'aaa'),
(2,200,'bbb'),
(3,300,'ccc'),
(4,400,'ddd');
INSERT INTO hudi_debug_mor_no_bucket VALUES
(5,500,'eee'),
(6,600,'fff');
```
In a result I got new parquet file for each row:
```Bash
hdfs dfs -ls hdfs://<some path>/hudi_debug_mor_no_bucket
Found 8 items
drwxr-xr-x - gdugarov supergroup 0 2025-01-09 15:38 hdfs://<some
path>/hudi_debug_mor_no_bucket/.hoodie
-rw-r--r-- 3 gdugarov supergroup 96 2025-01-09 15:38 hdfs://<some
path>/hudi_debug_mor_no_bucket/.hoodie_partition_metadata
-rw-r--r-- 3 gdugarov supergroup 433873 2025-01-09 15:38 hdfs://<some
path>/hudi_debug_mor_no_bucket/1644d782-e5ea-435a-80a2-365dbcc77eb2-0_6-8-0_20250109153855066.parquet
-rw-r--r-- 3 gdugarov supergroup 433874 2025-01-09 15:39 hdfs://<some
path>/hudi_debug_mor_no_bucket/30184188-88b3-4203-a842-17d119c7e567-0_1-8-0_20250109153920668.parquet
-rw-r--r-- 3 gdugarov supergroup 433870 2025-01-09 15:38 hdfs://<some
path>/hudi_debug_mor_no_bucket/7277ffc5-f0ee-4c5f-a8f6-50eb5cc72e35-0_4-8-0_20250109153855066.parquet
-rw-r--r-- 3 gdugarov supergroup 433873 2025-01-09 15:38 hdfs://<some
path>/hudi_debug_mor_no_bucket/c9e1b68a-fa14-44b1-8299-919a4083c8d0-0_5-8-0_20250109153855066.parquet
-rw-r--r-- 3 gdugarov supergroup 433871 2025-01-09 15:38 hdfs://<some
path>/hudi_debug_mor_no_bucket/ca68de14-9124-4282-942d-2f871a67f1d9-0_3-8-0_20250109153855066.parquet
-rw-r--r-- 3 gdugarov supergroup 433875 2025-01-09 15:39 hdfs://<some
path>/hudi_debug_mor_no_bucket/fb3ce28b-f4f4-4452-80a4-80ae430db3c9-0_2-8-0_20250109153920668.parquet
```
### With proposed changes
We run the same SQL, but with added parameters:
```SQL
'index.type'='BUCKET',
'hoodie.bucket.index.hash.field'='id',
'hoodie.bucket.index.num.buckets'='2'
```
In a results we get expected 4 files:
```Bash
hdfs dfs -ls hdfs://<some path>/hudi_debug_mor_bucket_supported
Found 6 items
drwxr-xr-x - gdugarov supergroup 0 2025-01-09 15:44 hdfs://<some
path>/hudi_debug_mor_bucket_supported/.hoodie
-rw-r--r-- 3 gdugarov supergroup 96 2025-01-09 15:44 hdfs://<some
path>/hudi_debug_mor_bucket_supported/.hoodie_partition_metadata
-rw-r--r-- 3 gdugarov supergroup 433970 2025-01-09 15:44 hdfs://<some
path>/hudi_debug_mor_bucket_supported/00000000-4604-43da-ad35-85d748591783_0-8-0_20250109154427505.parquet
-rw-r--r-- 3 gdugarov supergroup 433858 2025-01-09 15:44 hdfs://<some
path>/hudi_debug_mor_bucket_supported/00000000-b79c-412c-b05f-1e83a2455ed7_0-8-0_20250109154431469.parquet
-rw-r--r-- 3 gdugarov supergroup 433972 2025-01-09 15:44 hdfs://<some
path>/hudi_debug_mor_bucket_supported/00000001-480e-4db1-9b4a-188320116924_1-8-0_20250109154427505.parquet
-rw-r--r-- 3 gdugarov supergroup 433857 2025-01-09 15:44 hdfs://<some
path>/hudi_debug_mor_bucket_supported/00000001-4ade-444a-b972-971910278ab1_1-8-0_20250109154431469.parquet
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]