[
https://issues.apache.org/jira/browse/IMPALA-12337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759628#comment-17759628
]
Zoltán Borók-Nagy commented on IMPALA-12337:
--------------------------------------------
Thank you [~xiabaike] for your interest in doing this nice feature.
Other engines in the Iceberg ecosystem use the VACUUM statement for this, so
probably we can consider using a similar syntax. E.g. Dremio's syntax looks
pretty cool IMHO:
[https://docs.dremio.com/current/reference/sql/commands/apache-iceberg-tables/vacuum-table]
I also want to point out two cases that need to be handled very carefully:
* S3FileIO vs HadoopFileIO
** Impala uses HadoopFileIO while other engines might use S3FileIO
** S3FileIO uses prefix s3://, while HadoopFileIO uses prefix s3a://
** It means in the Iceberg metadata layer some files can be listed with s3://
while others might be liste with s3a:// prefixes
** When Impala does a recursive file listing via HadoopFileIO it will return
all paths with s3a:// prefixes
** So please make sure we handle this situation correctly, or at least raise
an error when we have files with s3:// prefixes
* LocationProviders
** Someone might use non-default location providers, e.g.
[https://github.com/apache/iceberg/blob/16dedd469bc9ec64fad3a40d3a9926edc8cc65f6/core/src/main/java/org/apache/iceberg/LocationProviders.java#L108]
** In this case data files are not stored under the table location
** The only information about these data files are in the Iceberg metadata
layer if the snapshots are not yet expired and manifest files are still around
** Though initially I think it's enough to only support tables with location
providers that put data files under the table location
> Delete orphan files for Iceberg table
> -------------------------------------
>
> Key: IMPALA-12337
> URL: https://issues.apache.org/jira/browse/IMPALA-12337
> Project: IMPALA
> Issue Type: New Feature
> Components: Catalog, Frontend
> Reporter: Baike Xia
> Assignee: Baike Xia
> Priority: Major
> Labels: impala-iceberg
>
> Removes all files from a table’s data directory that are not linked from
> metadata files and that are older than the value of older_than parameter.
> Deleting orphan files from time to time is recommended to keep size of a
> table’s data directory under control.
> {code:java}
> ALTER TABLE test_table EXECUTE remove_orphan_files(older_than = 1431691200)
> {code}
> See the syntax for expire_snapshot:IMPALA11362
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]