rdblue commented on issue #2900: URL: https://github.com/apache/iceberg/issues/2900#issuecomment-907871931
The orphan files check shouldn't make a difference here. The only action/operation you need to run is `expireSnapshots` to remove old snapshots that are taking up memory. Orphan files are files that are not referenced by table metadata. Snapshot expiration usually cleans up data files that were logically deleted, so under normal circumstances it should not leak data files for the orphan file cleanup to find. If I remember correctly, the problem after that was that you removed too many snapshots and could no longer run the downstream job because the snapshots were removed and Iceberg couldn't tell what had been done between the last snapshot that was processed and the oldest snapshot in history. What's happening here is that we get incremental changes by reading each snapshot and filtering down to manifests and data files that were added in that snapshot. Those files are what Flink processes when reading incremental changes from a table. But if you don't have all of the snapshots since the last time your Flink job ran, then although the data is in the table, Iceberg isn't sure what files it needs to process. That's because files could have been changed or rewritten before the current snapshot. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
