aokolnychyi edited a comment on issue #481: Consider adding sort key in RewriteManifests URL: https://github.com/apache/incubator-iceberg/issues/481#issuecomment-533691669 Let’s consider a specific example: we have a table partitioned by col1 and further bucketed into 20 buckets by col2. On one hand, clustering by both columns is too granular and yields small manifests. Alternatively, clustering only by col1 would mean producing let’s say 5 manifests of the needed size for each top level partition. The problem is that each of those 5 manifests can contain files for the same leaf partition. As a result, we will be scanning all the metadata for the top level partition even though we query for a specific leaf partition. To address those issues, I had multiple approaches in mind. Option 1 - Estimate the size of one manifest entry by iterating through the metadata for manifests we are rewriting. - Read manifests and cluster the files. - Compute the number of files per group and estimate the size of metadata. - Use bin packing on the info from the step above assuming the clustering key is sortable. That way, we will group metadata for leaf partitions together. - Finally, group files based on bins. Option 2 - Introduce a sort key in addition to the clustering key. In the example above, we can say cluster data by col1 and sort by col2. - Read manifests and cluster files. - Sort files within groups. - Write manifests respecting the target manifest size. In our case, we will produce 5 manifests but only one of them will contain data for the same leaf partition (unless we were unlucky and we closed the manifest in the middle, which we can prevent). To sum up, sort key would only help if we also respect the target manifest size. I considered these approaches from a distributed job perspective and it seemed we would need to scan the metadata multiple times using the first option. We can cache, of course. Having entries for the same partition close to each other might be cpu-friendly. @bryanck @rdblue is my first option similar to what you suggest?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
