rdblue commented on issue #481: Consider adding sort key in RewriteManifests URL: https://github.com/apache/incubator-iceberg/issues/481#issuecomment-534217959 Okay, I understand the motivation for sorting now. I think sorting is still a difficult way to solve this problem because you'd need to keep so many files in memory. Tables that need metadata rewrites usually have lots of data files. The approach we took is to look at the metadata tables to get the number of entries and the current size of all manifests, and then use that to estimate how many buckets should be grouped together. Then the grouping function returns a key with `bucket_num % group_size`. That's a much simpler approach than sorting. The drawback is that it relies on the number of files per bucket staying fairly stable, but I think that's reasonable in most cases.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
