[GitHub] [iceberg] rdblue commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

GitBox Sun, 26 Jul 2020 16:32:11 -0700


rdblue commented on issue #1159:
URL: https://github.com/apache/iceberg/issues/1159#issuecomment-664054683



   We do this a bit differently: instead of rewriting everything that matches a 
filter, we configure the rewrite to primarily look for small files. Any file 
already near the target size is ignored, and then we bin pack the rest of the 
files in each partition and rewrite. That ensures that we don't rewrite large 
amounts of data that are already reasonably sized. We don't care about files 
that are too large because they can be split, only the small files.
   
   We also have an option to keep the files in order by file name for systems 
like Spark that sort the data. This is a hacky way to avoid ruining file 
pruning that takes advantage of clustered/sorted data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on issue #1159: Avoid rewriting big files in RewriteDataFilesAction

Reply via email to