geserdugarov commented on code in PR #12545:
URL: https://github.com/apache/hudi/pull/12545#discussion_r1908303305


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##########
@@ -207,11 +207,30 @@ public static DataStream<Object> append(
       Configuration conf,
       RowType rowType,
       DataStream<RowData> dataStream) {
-    WriteOperatorFactory<RowData> operatorFactory = 
AppendWriteOperator.getFactory(conf, rowType);
+    boolean isBucketIndex = OptionsResolver.isBucketIndexType(conf);
+    if (isBucketIndex) {

Review Comment:
   There is a major problem with proposed changes.
   
   ### With proposed in this MR changes
   **Switching from `insert` some data to `upsert` of another bunch of data 
wouldn't work** due to:
   
https://github.com/apache/hudi/blob/e67d0aa71e2253a5b5cf95028cdf95482ffeca6a/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bucket/BucketStreamWriteFunction.java#L181-L184
   
   We will get exception if we will try it.
   
   ### Behavior without changes
   
   Results are also questionable due to data duplication.
   
   Paste using `insert`:
   ```SQL
   INSERT INTO hudi_debug_mor_no_bucket VALUES 
       (1,100,'aaa'),
       (2,200,'bbb'),
       (3,300,'ccc'),
       (4,400,'ddd');
   INSERT INTO hudi_debug_mor_no_bucket VALUES 
       (5,500,'eee'),
       (6,600,'fff');
   ```
   Switch to `upsert` and run:
   ```SQL
   INSERT INTO hudi_debug_mor_no_bucket VALUES 
       (5,500,'new value'),
       (6,600,'new value');
   ```
   
   In a results, files in HDFS:
   ```Text
   hdfs dfs -ls hdfs://<some path>/hudi_debug_mor_no_bucket
   Found 10 items
   -rw-r--r--   3 gdugarov supergroup        843 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/.acc9cb5b-d3b7-48af-8c85-f7bc07fa0dc1_20250109173254065.log.1_1-8-0
   -rw-r--r--   3 gdugarov supergroup        845 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/.c0d6fc84-981e-46a8-a947-69f71e649b0f-0_20250109173254065.log.1_4-8-0
   drwxr-xr-x   - gdugarov supergroup          0 2025-01-09 17:36 hdfs://<some 
path>/hudi_debug_mor_no_bucket/.hoodie
   -rw-r--r--   3 gdugarov supergroup         96 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/.hoodie_partition_metadata
   -rw-r--r--   3 gdugarov supergroup     433866 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/433d650d-d15f-412d-a31f-f5e4cab42482-0_7-8-0_20250109173159352.parquet
   -rw-r--r--   3 gdugarov supergroup     433863 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/512fbdf5-5c6c-47a5-afce-60138583287e-0_5-8-0_20250109173159352.parquet
   -rw-r--r--   3 gdugarov supergroup     433868 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/749a3c2c-5629-423c-a040-bab452c1e8d4-0_6-8-0_20250109173205420.parquet
   -rw-r--r--   3 gdugarov supergroup     433866 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/9f637562-ec60-4abd-8c63-0e663ec420a9-0_4-8-0_20250109173159352.parquet
   -rw-r--r--   3 gdugarov supergroup     433867 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/c0d6fc84-981e-46a8-a947-69f71e649b0f-0_6-8-0_20250109173159352.parquet
   -rw-r--r--   3 gdugarov supergroup     433869 2025-01-09 17:32 hdfs://<some 
path>/hudi_debug_mor_no_bucket/e2f33aac-8e5d-4437-9c02-8e664e6bcd97-0_7-8-0_20250109173205420.parquet
   ```
   Result of `SELECT * FROM hudi_debug_mor_no_bucket ORDER BY id;` in Spark:
   ```text
   spark-sql (default_database)> SELECT * FROM hudi_debug_mor_no_bucket ORDER 
BY id;
   20250109173159352    20250109173159352_4_3   1               
9f637562-ec60-4abd-8c63-0e663ec420a9-0_4-8-0_20250109173159352.parquet  1       
100     aaa
   20250109173159352    20250109173159352_5_1   2               
512fbdf5-5c6c-47a5-afce-60138583287e-0_5-8-0_20250109173159352.parquet  2       
200     bbb
   20250109173159352    20250109173159352_6_2   3               
c0d6fc84-981e-46a8-a947-69f71e649b0f-0_6-8-0_20250109173159352.parquet  3       
300     ccc
   20250109173159352    20250109173159352_7_4   4               
433d650d-d15f-412d-a31f-f5e4cab42482-0_7-8-0_20250109173159352.parquet  4       
400     ddd
   20250109173254065    20250109173254065_4_2   5               
c0d6fc84-981e-46a8-a947-69f71e649b0f-0                                  5       
500     new value
   20250109173205420    20250109173205420_6_5   5               
749a3c2c-5629-423c-a040-bab452c1e8d4-0_6-8-0_20250109173205420.parquet  5       
500     eee
   20250109173254065    20250109173254065_1_1   6               
acc9cb5b-d3b7-48af-8c85-f7bc07fa0dc1                                    6       
600     new value
   20250109173205420    20250109173205420_7_6   6               
e2f33aac-8e5d-4437-9c02-8e664e6bcd97-0_7-8-0_20250109173205420.parquet  6       
600     fff
   Time taken: 4.605 seconds, Fetched 8 row(s)
   ```
   
   We get two record for `id=5` and `id=6`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to