kbendick commented on issue #3447: URL: https://github.com/apache/iceberg/issues/3447#issuecomment-958179727
A few things are possible that I can think of. First and foremost, you should check out the docs on: - table maintenance https://iceberg.apache.org/#maintenance/#table-maintenance - streaming table maintenance: https://iceberg.apache.org/#spark-structured-streaming/#maintenance-for-streaming-tables If you're running the expire snapshots operation, keep in mind that there is the option for how many days worth of data do you want to retain. By default, the spark action to expire snapshots retains 5 days worth of snapshots, which would explain why no files are being removed. When you run the expire snapshots job, do you see any logs saying things are enabled? Some links that might help: - Javadoc for the action itself: https://iceberg.apache.org/#javadoc/0.12.0/org/apache/iceberg/ExpireSnapshots.html - Docs for the options from the Spark maintenance actions: https://iceberg.apache.org/#spark-procedures/#metadata-management I'd also make sure that your table doesn't have `gc.enabled = true` as a table property. If that's set to `false` (default should be `true`), then files won't be removed regardless. Additionally, it's possible that you have orphan files from commits that did not succeed and needed to be retried. You should also run the `remove orphan files` action: https://iceberg.apache.org/#maintenance/#remove-orphan-files How are you running the actions? With Java code or with Spark? I don't believe that Flink presently supports all of the maintenance procedures. Either way, given that it's only 10 hours worth of data, you'll definitely need to pass in the timestamp you want to expire older than. Try it out, ensuring that you pass in something for the `older_than` field of the Spark procedure (or the equivalent method if using the Java API). Something like `now() - INTERVAL '4 hours'` or `System.currentTimeMillis - (Duration.ofHours(4).getSeconds() * 1000)` or something. Let us know if that solves your problem, particularly passing in a maximum age of files to keep. The default is well over 10 hours in all cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
