aokolnychyi opened a new issue #1598:
URL: https://github.com/apache/iceberg/issues/1598
```
CALL catalog.schema.rewrite_manifests(
namespace => 'namespace_name', -- required
table => 'table_name', -- required
min_manifest_size => 0.5 * of target manifest size, -- optional
max_manifest_size => 1.5 * of target manifest size, -- optional
min_num_manifests_to_rewrite => 10, -- optional
min_clustering_ratio => 0.75 -- optional
)
```
The command can return the produced snapshot id, the number of deleted and
added manifests, the number of records we rewrote metadata for.
It can work as follows:
- Iterate through the list of manifests and find out what manifest files are
not optimal from the size perspective. We have the target manifest size in
table properties and the stored procedure can accept allowed deviations (with
some default value).
- Analyze the clustering of metadata entries within optimal manifests. We
need to find out non-overlapping manifests and compute the total number of
entries in them. Then we should compare that number to the total number of
entries in all manifests. This gives us an idea of how well our metadata is
clustered. We can check whether manifests overlap based on min/max stats for
partition columns.
- If clustering is bad, we should rewrite all metadata. Rewriting all
metadata is relatively cheap even for tables with millions of files if snapshot
id inheritance is enabled.
- If clustering is OK, we should look only into non-optimal files from the
size perspective.
- If the number of too small or too big files is larger than the
threshold, we should rewrite those manifests.
- If the number of too small or too big files is smaller or equal to the
threshold, nothing should be done as the clustering is OK and we don't have
enough manifests to rewrite.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]