Jiezhi opened a new issue #4190:
URL: https://github.com/apache/hudi/issues/4190


   
   **Describe the problem you faced**
   
   In hudi 0.9, using Flink 1.12.2 SQL client sink logs to hudi **cow** table 
in **insert** mode. The small files would be merged into a few parquet files.
   
   But hudi 0.10, same code would produce lots of small parquet files.  And 
turn  on **write.insert.cluster** option     metioned in 
[doc](https://hudi.apache.org/docs/next/flink-quick-start-guide#options-3) had 
no effect.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. hudi table option:
   ```sql
   ...
   WITH (
     'connector' = 'hudi',
     'path' = 'hdfs://xxx:8020/hudi/xxx',
     'write.precombine.field' = 'time',
     'write.operation' = 'insert',
     'write.insert.cluster' = 'true',
     'write.insert.deduplicate' = 'true',
     'hoodie.datasource.write.recordkey.field' = 
'distinct_id,time,event,_track_id',
     'read.streaming.enabled' = 'true',  -- this option enable the streaming 
read
     'read.streaming.start-commit' = '20211001000000', -- specifies the start 
commit instant time
     'read.streaming.check-interval' = '60', -- specifies the check interval 
for finding new source commits, default 60s.
     'table.type' = 'COPY_ON_WRITE', -- this creates a MERGE_ON_READ table, by 
default is COPY_ON_WRITE
     'hive_sync.enable' = 'true',     -- Required. To enable hive 
synchronization
     'hive_sync.mode' = 'hms',         -- Required. Setting hive sync mode to 
hms, default jdbc
     'hive_sync.table'='xxx',                          -- required, hive table 
name
     'hive_sync.db'='xxx',
     'hive_sync.metastore.uris' = 'thrift://xxx9083' -- Required. The port need 
set on hive-site.xml
   );
   ```
   
   
   **Expected behavior**
   
   Small files get  merged.
   
   **Environment Description**
   
   * Hudi version : 0.10-snapshot/0.11-snapshot
   
   * Spark version :NA
   
   * Hive version :2.1.1
   
   * Hadoop version :3.0
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no 
   
   * Flink: 1.13.2
   
   **Additional context**
   
   NA
   
   **Stacktrace**
   
   NA
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to