Re: [PR] [WIP] [HUDI-8796] Silent ignoring of simple bucket index in Flink append mode [hudi]

via GitHub Wed, 08 Jan 2025 20:50:29 -0800


geserdugarov commented on code in PR #12545:
URL: https://github.com/apache/hudi/pull/12545#discussion_r1908142484



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java:
##########
@@ -207,11 +207,30 @@ public static DataStream<Object> append(
       Configuration conf,
       RowType rowType,
       DataStream<RowData> dataStream) {
-    WriteOperatorFactory<RowData> operatorFactory = 
AppendWriteOperator.getFactory(conf, rowType);
+    boolean isBucketIndex = OptionsResolver.isBucketIndexType(conf);
+    if (isBucketIndex) {

Review Comment:
   @danny0405 , I've checked behavior for MOR table, if we set bucket index. My 
bad, I didn't check all cases. So for inserts into MOR, new parquet files in 
buckets are created during each insert. It's similar to what I tried to 
implement in this MR. Therefore **users should use MOR table to insert data 
using bucket index**, and there is no need in proposed changes.
   
   But I'm worried that currently I can set bucket index for COW table, and 
insert data. But **data will be written to parquets ignoring buckets 
silently**. Maybe we should restrict this operations, and throw exception with 
message:
   "Bucket index is not supported for inserts into COW table. Please, use MOR 
table or upsert operation."
   Or we could log corresponding warning at least.
   
   What do you think about it? Is it better to throw exception or log 
corresponding warning?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [WIP] [HUDI-8796] Silent ignoring of simple bucket index in Flink append mode [hudi]

Reply via email to