RussellSpitzer edited a comment on pull request #2196: URL: https://github.com/apache/iceberg/pull/2196#issuecomment-772584764
My quick notes on this issue: ``` Previously when computing the rewrite tasks for RewriteDataFiles the code would ignore scan tasks which referred to a single file. This is an issue because large files could be potentitally split into multiple read tasks. If one slice of a large file was combined with a slice from another file, that secition would be rewritten with the other file, but the other slices would be ignored. For example given 2 files File A - 100 Bytes File B - 10 Bytes If the target split size was 60 bytes we would end up with 3 tasks A : 1 - 60 A : 61 - 100 B : 0 - 10 Which would be combined into (A : 1 - 60) (A : 61 -100, B : 0 -10) The first task would be discarded since it only refered to one file. The second task would be rewritten, which would end with deleting file A and B. I believe the original intent was to ignore single file scan tasks as it was assumed these would be unchanged files. But if a single file scan task only contains a partial scan of a file it must be rewritten since it represents a new smaller file that needs to be rewritten. Normally this doesn't cause data loss since an ignored file won't be deleted, but if a split is combined with another file, then that triggers the delete of the large file, even though several splits of the large file will not have been written into new files. ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
