vkhoroshko opened a new issue, #10622: URL: https://github.com/apache/hudi/issues/10622
**To Reproduce** Steps to reproduce the behavior: 1. Use Flink SQL with the file below. **Current behavior** A separate parquet file is produced with every Flink commit (during checkpointing) **Expected behavior** Data is appended to existing parquet file(s) until max size threshold is met. A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : * 0.14.1 * Flink version : 1.17.1 * Storage (HDFS/S3/GCS..) : File System * Running on Docker? (yes/no) : yes **Additional context** The expectation (as depicted in Apache Hudi docs - https://hudi.apache.org/docs/file_sizing#auto-sizing-during-writes) is that with every flink commit (every minute) - a set of records will be accumulated and written to one of existing parquet files until parquet file max size threshold is met (in the example below is 5MB). However, what happens is that every commit results in a separate parquet file (~400KB size) which are accumulated and are never merged. Please, help. SQL file: ``` SET 'parallelism.default' = '1'; SET 'execution.checkpointing.interval' = '1m'; CREATE TABLE datagen ( id INT NOT NULL PRIMARY KEY NOT ENFORCED, data STRING ) WITH ( 'connector' = 'datagen', 'rows-per-second' = '5' ); CREATE TABLE hudi_tbl ( id INT NOT NULL PRIMARY KEY NOT ENFORCED, data STRING ) WITH ( 'connector' = 'hudi', 'path' = 'file:///opt/hudi', 'table.type' = 'COPY_ON_WRITE', 'write.parquet.block.size' = '1', 'write.operation' = 'insert', 'write.parquet.max.file.size' = '5' ); INSERT INTO hudi_tbl SELECT * from datagen; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
