aokolnychyi opened a new issue #1598:
URL: https://github.com/apache/iceberg/issues/1598


   ```
   CALL catalog.schema.rewrite_manifests(
     namespace => 'namespace_name', -- required
     table => 'table_name', -- required
     min_manifest_size => 0.5 * of target manifest size, -- optional
     max_manifest_size => 1.5 * of target manifest size, -- optional
     min_num_manifests_to_rewrite => 10, -- optional
     min_clustering_ratio => 0.75 -- optional
   )
   ```
   
   The command can return the produced snapshot id, the number of deleted and 
added manifests, the number of records we rewrote metadata for.
   
   It can work as follows:
   
   - Iterate through the list of manifests and find out what manifest files are 
not optimal from the size perspective. We have the target manifest size in 
table properties and the stored procedure can accept allowed deviations (with 
some default value). 
   - Analyze the clustering of metadata entries within optimal manifests. We 
need to find out non-overlapping manifests and compute the total number of 
entries in them. Then we should compare that number to the total number of 
entries in all manifests. This gives us an idea of how well our metadata is 
clustered. We can check whether manifests overlap based on min/max stats for 
partition columns.
   - If clustering is bad, we should rewrite all metadata. Rewriting all 
metadata is relatively cheap even for tables with millions of files if snapshot 
id inheritance is enabled.
   - If clustering is OK, we should look only into non-optimal files from the 
size perspective.
       - If the number of too small or too big files is larger than the 
threshold, we should rewrite those manifests.
       - If the number of too small or too big files is smaller or equal to the 
threshold, nothing should be done as the clustering is OK and we don't have 
enough manifests to rewrite.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to