rdblue commented on issue #2900:
URL: https://github.com/apache/iceberg/issues/2900#issuecomment-907871931


   The orphan files check shouldn't make a difference here. The only 
action/operation you need to run is `expireSnapshots` to remove old snapshots 
that are taking up memory. Orphan files are files that are not referenced by 
table metadata. Snapshot expiration usually cleans up data files that were 
logically deleted, so under normal circumstances it should not leak data files 
for the orphan file cleanup to find.
   
   If I remember correctly, the problem after that was that you removed too 
many snapshots and could no longer run the downstream job because the snapshots 
were removed and Iceberg couldn't tell what had been done between the last 
snapshot that was processed and the oldest snapshot in history.
   
   What's happening here is that we get incremental changes by reading each 
snapshot and filtering down to manifests and data files that were added in that 
snapshot. Those files are what Flink processes when reading incremental changes 
from a table. But if you don't have all of the snapshots since the last time 
your Flink job ran, then although the data is in the table, Iceberg isn't sure 
what files it needs to process. That's because files could have been changed or 
rewritten before the current snapshot.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to