rdblue commented on issue #481: Consider adding sort key in RewriteManifests
URL: 
https://github.com/apache/incubator-iceberg/issues/481#issuecomment-534217959
 
 
   Okay, I understand the motivation for sorting now. I think sorting is still 
a difficult way to solve this problem because you'd need to keep so many files 
in memory. Tables that need metadata rewrites usually have lots of data files.
   
   The approach we took is to look at the metadata tables to get the number of 
entries and the current size of all manifests, and then use that to estimate 
how many buckets should be grouped together. Then the grouping function returns 
a key with `bucket_num % group_size`.
   
   That's a much simpler approach than sorting. The drawback is that it relies 
on the number of files per bucket staying fairly stable, but I think that's 
reasonable in most cases.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to