[GitHub] [iceberg] rdblue commented on issue #4194: Delete Orphan Files makes metadata inconsistent and table unusable

GitBox Wed, 23 Feb 2022 08:23:29 -0800


rdblue commented on issue #4194:
URL: https://github.com/apache/iceberg/issues/4194#issuecomment-1048962292



   @jotarada, do you have any information about the run that deleted the file? 
Were there concurrent writes? And what length of time did you use for 
`older_than`?
   
   The interval you use is important if you have jobs that run for a long time. 
What usually is the case when this happens is that the `older_than` timestamp 
allows removing files that haven't been committed yet. For example, if you have 
`older_than` set to 3 hours ago and a job writes files for 4 hours, then the 
job may write files that get caught as orphan files because they're older than 
the limit and not (yet) committed to the table.
   
   Another possibility is that your file system listing doesn't match the table 
listing, but that's more rare---we've only seen it with HDFS alternate name 
nodes so far. It seems unlikely that would happen with GCS.
   
   If you can share some of the logs from the orphan files run, that would be 
helpful!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #4194: Delete Orphan Files makes metadata inconsistent and table unusable

Reply via email to