Jiezhi opened a new issue #4190: URL: https://github.com/apache/hudi/issues/4190
**Describe the problem you faced** In hudi 0.9, using Flink 1.12.2 SQL client sink logs to hudi **cow** table in **insert** mode. The small files would be merged into a few parquet files. But hudi 0.10, same code would produce lots of small parquet files. And turn on **write.insert.cluster** option metioned in [doc](https://hudi.apache.org/docs/next/flink-quick-start-guide#options-3) had no effect. **To Reproduce** Steps to reproduce the behavior: 1. hudi table option: ```sql ... WITH ( 'connector' = 'hudi', 'path' = 'hdfs://xxx:8020/hudi/xxx', 'write.precombine.field' = 'time', 'write.operation' = 'insert', 'write.insert.cluster' = 'true', 'write.insert.deduplicate' = 'true', 'hoodie.datasource.write.recordkey.field' = 'distinct_id,time,event,_track_id', 'read.streaming.enabled' = 'true', -- this option enable the streaming read 'read.streaming.start-commit' = '20211001000000', -- specifies the start commit instant time 'read.streaming.check-interval' = '60', -- specifies the check interval for finding new source commits, default 60s. 'table.type' = 'COPY_ON_WRITE', -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE 'hive_sync.enable' = 'true', -- Required. To enable hive synchronization 'hive_sync.mode' = 'hms', -- Required. Setting hive sync mode to hms, default jdbc 'hive_sync.table'='xxx', -- required, hive table name 'hive_sync.db'='xxx', 'hive_sync.metastore.uris' = 'thrift://xxx9083' -- Required. The port need set on hive-site.xml ); ``` **Expected behavior** Small files get merged. **Environment Description** * Hudi version : 0.10-snapshot/0.11-snapshot * Spark version :NA * Hive version :2.1.1 * Hadoop version :3.0 * Storage (HDFS/S3/GCS..) : HDFS * Running on Docker? (yes/no) : no * Flink: 1.13.2 **Additional context** NA **Stacktrace** NA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
