vkhoroshko opened a new issue, #10622:
URL: https://github.com/apache/hudi/issues/10622

   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Use Flink SQL with the file below.
   
   **Current behavior**
   A separate parquet file is produced with every Flink commit (during 
checkpointing)
   
   **Expected behavior**
   Data is appended to existing parquet file(s) until max size threshold is met.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   * 0.14.1
   
   * Flink version :
   1.17.1
   
   * Storage (HDFS/S3/GCS..) :
   File System
   
   * Running on Docker? (yes/no) :
   yes
   
   
   **Additional context**
   The expectation (as depicted in Apache Hudi docs - 
https://hudi.apache.org/docs/file_sizing#auto-sizing-during-writes) is that 
with every flink commit (every minute) - a set of records will be accumulated 
and written to one of existing parquet files until parquet file max size 
threshold is met (in the example below is 5MB).
   However, what happens is that every commit results in a separate parquet 
file (~400KB size) which are accumulated and are never merged. Please, help.
   
   SQL file:
   ```
   SET 'parallelism.default' = '1';
   SET 'execution.checkpointing.interval' = '1m';
   
   CREATE TABLE datagen
   (
       id   INT NOT NULL PRIMARY KEY NOT ENFORCED,
       data STRING
   ) WITH (
         'connector' = 'datagen',
         'rows-per-second' = '5'
   );
   
   CREATE TABLE hudi_tbl
   (
       id   INT NOT NULL PRIMARY KEY NOT ENFORCED,
       data STRING
   ) WITH (
         'connector' = 'hudi',
         'path' = 'file:///opt/hudi',
         'table.type' = 'COPY_ON_WRITE',
         'write.parquet.block.size' = '1',
         'write.operation' = 'insert',
         'write.parquet.max.file.size' = '5'
   );
   
   INSERT INTO hudi_tbl SELECT * from datagen;
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to