Re: [PR] [HUDI-8208] Fix partition stats bound when compacting or clustering [hudi]

via GitHub Tue, 08 Oct 2024 23:05:31 -0700


codope commented on PR #12050:
URL: https://github.com/apache/hudi/pull/12050#issuecomment-2401389180


   > > To calculate tight bound, we look at the colstats partition for the 
uncompacted or unclustered files and then merge the colstats with that of the 
compacted or clustered files.
   > 
   > Are you saying instead of using the native min_max range for columns in 
files generated from compaction and clustering, we recompute the column stats 
ranges from the source files? For example if we have f1 with range [v1, v2] and 
f2 with range [v3, v4], instead of using [v1, v4] as the compaction file range, 
we still use the composition of [v1, v2] and [v3, v4] ?
   
   For the files generated from compaction and clustering, we were already 
using the native min, max range. But, we ignored the files that were not 
compacted or clustered from the partition stats update. If, luckily, all the 
file slices in a partition were compacted or clustered, then the partition 
stats would have a tight bound even without this patch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8208] Fix partition stats bound when compacting or clustering [hudi]

Reply via email to