[ 
https://issues.apache.org/jira/browse/IMPALA-12337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759628#comment-17759628
 ] 

Zoltán Borók-Nagy commented on IMPALA-12337:
--------------------------------------------

Thank you [~xiabaike] for your interest in doing this nice feature.

Other engines in the Iceberg ecosystem use the VACUUM statement for this, so 
probably we can consider using a similar syntax. E.g. Dremio's syntax looks 
pretty cool IMHO: 
[https://docs.dremio.com/current/reference/sql/commands/apache-iceberg-tables/vacuum-table]

I also want to point out two cases that need to be handled very carefully:
 * S3FileIO vs HadoopFileIO
 ** Impala uses HadoopFileIO while other engines might use S3FileIO
 ** S3FileIO uses prefix s3://, while HadoopFileIO uses prefix s3a://
 ** It means in the Iceberg metadata layer some files can be listed with s3:// 
while others might be liste with s3a:// prefixes
 ** When Impala does a recursive file listing via HadoopFileIO it will return 
all paths with s3a:// prefixes
 ** So please make sure we handle this situation correctly, or at least raise 
an error when we have files with s3:// prefixes
 * LocationProviders
 ** Someone might use non-default location providers, e.g. 
[https://github.com/apache/iceberg/blob/16dedd469bc9ec64fad3a40d3a9926edc8cc65f6/core/src/main/java/org/apache/iceberg/LocationProviders.java#L108]
 ** In this case data files are not stored under the table location
 ** The only information about these data files are in the Iceberg metadata 
layer if the snapshots are not yet expired and manifest files are still around
 ** Though initially I think it's enough to only support tables with location 
providers that put data files under the table location

> Delete orphan files for Iceberg table
> -------------------------------------
>
>                 Key: IMPALA-12337
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12337
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Catalog, Frontend
>            Reporter: Baike Xia
>            Assignee: Baike Xia
>            Priority: Major
>              Labels: impala-iceberg
>
> Removes all files from a table’s data directory that are not linked from 
> metadata files and that are older than the value of older_than parameter. 
> Deleting orphan files from time to time is recommended to keep size of a 
> table’s data directory under control.
> {code:java}
> ALTER TABLE test_table EXECUTE remove_orphan_files(older_than =  1431691200) 
> {code}
> See the syntax for expire_snapshot:IMPALA11362



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to