geserdugarov commented on code in PR #12545:
URL: https://github.com/apache/hudi/pull/12545#discussion_r1908303305
##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##########
@@ -207,11 +207,30 @@ public static DataStream<Object> append(
Configuration conf,
RowType rowType,
DataStream<RowData> dataStream) {
- WriteOperatorFactory<RowData> operatorFactory =
AppendWriteOperator.getFactory(conf, rowType);
+ boolean isBucketIndex = OptionsResolver.isBucketIndexType(conf);
+ if (isBucketIndex) {
Review Comment:
There is a major problem with proposed changes.
### With proposed in this MR changes
**Switching from `insert` some data to `upsert` of another bunch of data
wouldn't work** due to:
https://github.com/apache/hudi/blob/e67d0aa71e2253a5b5cf95028cdf95482ffeca6a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bucket/BucketStreamWriteFunction.java#L181-L184
We will get exception if we will try it.
### Behavior without changes
Results are also questionable due to data duplication.
Paste using `insert`:
```SQL
INSERT INTO hudi_debug_mor_no_bucket VALUES
(1,100,'aaa'),
(2,200,'bbb'),
(3,300,'ccc'),
(4,400,'ddd');
INSERT INTO hudi_debug_mor_no_bucket VALUES
(5,500,'eee'),
(6,600,'fff');
```
Switch to `upsert` and run:
```SQL
INSERT INTO hudi_debug_mor_no_bucket VALUES
(5,500,'new value'),
(6,600,'new value');
```
In a results, files in HDFS:
```Text
hdfs dfs -ls hdfs://<some path>/hudi_debug_mor_no_bucket
Found 10 items
-rw-r--r-- 3 gdugarov supergroup 843 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/.acc9cb5b-d3b7-48af-8c85-f7bc07fa0dc1_20250109173254065.log.1_1-8-0
-rw-r--r-- 3 gdugarov supergroup 845 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/.c0d6fc84-981e-46a8-a947-69f71e649b0f-0_20250109173254065.log.1_4-8-0
drwxr-xr-x - gdugarov supergroup 0 2025-01-09 17:36 hdfs://<some
path>/hudi_debug_mor_no_bucket/.hoodie
-rw-r--r-- 3 gdugarov supergroup 96 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/.hoodie_partition_metadata
-rw-r--r-- 3 gdugarov supergroup 433866 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/433d650d-d15f-412d-a31f-f5e4cab42482-0_7-8-0_20250109173159352.parquet
-rw-r--r-- 3 gdugarov supergroup 433863 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/512fbdf5-5c6c-47a5-afce-60138583287e-0_5-8-0_20250109173159352.parquet
-rw-r--r-- 3 gdugarov supergroup 433868 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/749a3c2c-5629-423c-a040-bab452c1e8d4-0_6-8-0_20250109173205420.parquet
-rw-r--r-- 3 gdugarov supergroup 433866 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/9f637562-ec60-4abd-8c63-0e663ec420a9-0_4-8-0_20250109173159352.parquet
-rw-r--r-- 3 gdugarov supergroup 433867 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/c0d6fc84-981e-46a8-a947-69f71e649b0f-0_6-8-0_20250109173159352.parquet
-rw-r--r-- 3 gdugarov supergroup 433869 2025-01-09 17:32 hdfs://<some
path>/hudi_debug_mor_no_bucket/e2f33aac-8e5d-4437-9c02-8e664e6bcd97-0_7-8-0_20250109173205420.parquet
```
Result of `SELECT * FROM hudi_debug_mor_no_bucket ORDER BY id;` in Spark:
```text
spark-sql (default_database)> SELECT * FROM hudi_debug_mor_no_bucket ORDER
BY id;
20250109173159352 20250109173159352_4_3 1
9f637562-ec60-4abd-8c63-0e663ec420a9-0_4-8-0_20250109173159352.parquet 1
100 aaa
20250109173159352 20250109173159352_5_1 2
512fbdf5-5c6c-47a5-afce-60138583287e-0_5-8-0_20250109173159352.parquet 2
200 bbb
20250109173159352 20250109173159352_6_2 3
c0d6fc84-981e-46a8-a947-69f71e649b0f-0_6-8-0_20250109173159352.parquet 3
300 ccc
20250109173159352 20250109173159352_7_4 4
433d650d-d15f-412d-a31f-f5e4cab42482-0_7-8-0_20250109173159352.parquet 4
400 ddd
20250109173254065 20250109173254065_4_2 5
c0d6fc84-981e-46a8-a947-69f71e649b0f-0 5
500 new value
20250109173205420 20250109173205420_6_5 5
749a3c2c-5629-423c-a040-bab452c1e8d4-0_6-8-0_20250109173205420.parquet 5
500 eee
20250109173254065 20250109173254065_1_1 6
acc9cb5b-d3b7-48af-8c85-f7bc07fa0dc1 6
600 new value
20250109173205420 20250109173205420_7_6 6
e2f33aac-8e5d-4437-9c02-8e664e6bcd97-0_7-8-0_20250109173205420.parquet 6
600 fff
Time taken: 4.605 seconds, Fetched 8 row(s)
```
We get two record for `id=5` and `id=6`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]