aokolnychyi edited a comment on issue #481: Consider adding sort key in 
RewriteManifests
URL: 
https://github.com/apache/incubator-iceberg/issues/481#issuecomment-533691669
 
 
   Let’s consider a specific example: we have a table partitioned by col1 and 
further bucketed into 20 buckets by col2. On one hand, clustering by both 
columns is too granular and yields small manifests. Alternatively, clustering 
only by col1 would mean producing let’s say 5 manifests of the needed size for 
each top level partition. The problem is that each of those 5 manifests can 
contain files for the same leaf partition. As a result, we will be scanning all 
the metadata for the top level partition even though we query for a specific 
leaf partition. 
   
   To address those issues, I had multiple approaches in mind. 
   
   Option 1
   
   - Estimate the size of one manifest entry by iterating through the metadata 
for manifests we are rewriting.
   - Read manifests and cluster the files.
   - Compute the number of files per group and estimate the size of metadata. 
   - Use bin packing on the info from the step above assuming the clustering 
key is sortable. That way, we will group metadata for leaf partitions together. 
   - Finally, group files based on bins. 
   
   Option 2
   
   - Introduce a sort key in addition to the clustering key. In the example 
above, we can say cluster data by col1 and sort by col2.
   - Read manifests and cluster files. 
   - Sort files within groups. 
   - Write manifests respecting the target manifest size. In our case, we will 
produce 5 manifests but only one of them will contain data for the same leaf 
partition (unless we were unlucky and we closed the manifest in the middle, 
which we can prevent). 
   
   To sum up, sort key would only help if we also respect the target manifest 
size.
   
   I considered these approaches from a distributed job perspective and it 
seemed we would need to scan the metadata multiple times using the first 
option. We can cache, of course. Having entries for the same partition close to 
each other might be cpu-friendly. 
   
   @bryanck @rdblue is my first option similar to what you suggest?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to