[GitHub] [iceberg] wanlce opened a new issue, #8510: Iceberg does not trigger actually rewrite_data_files in certain situations

via GitHub Wed, 06 Sep 2023 02:29:09 -0700


wanlce opened a new issue, #8510:
URL: https://github.com/apache/iceberg/issues/8510


   ### Query engine
   
   - Flink 1.13
   - Spark 3.2
   - Iceberg 1.2
   
   ### Question
   
   The default file size for the merged target in iceberg is 512M. Currently, 
data is written to iceberg through Flink CDC. Due to the checkpoint time being 
5 minutes, there will be a large number of small files generated. Therefore, we 
adopt a scheduling approach to regularly execute rewrite_data_files and execute 
expire_snapshots, the date is 3 days ago.
   
   
   SQL Query:
   ```
    select count (1) as cnt from catalog. schema. tab. files where file_ 
Size<1M;
    ```
   result:
     cnt=1.7w
   then:
   I manually execute, rewrite_Data_ Merge small files, return: No small files 
need to be merged
   
   Question 1:
   Why I manually trigger a small file merge operation without actually 
executing it ?
   
   
   Question 2:
   I don't see the corresponding parameters about rewrite_data_files on the 
official website, Is there a threshold for controlling whether small file 
merging is actually execute ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] wanlce opened a new issue, #8510: Iceberg does not trigger actually rewrite_data_files in certain situations

Reply via email to