Hello all, Noticed in a recent run that there were some skews on the bloom filter checking stage. I noticed that even though sort based partitioning uniformly distributes the records among partitions, the cost is controlled by number of file groups being checked in one partitioning..
I chose to prototype a file group based custom partitioner, with the intention of distributing this more evenly.. I am seeing consistently good results for e.g ``` Metric Min 25th percentile Median 75th percentile Max Duration 2 s 14 s 48 s 1.6 min 3.9 min ``` becomes ``` Metric Min 25th percentile Median 75th percentile Max Duration 21 s 40 s 44 s 49 s 1.9 min ``` I can just make this a configuration per se.. So probably worth getting it in and iterating? If y'all think so, will prep a PR. HUDI-108 tracks this Thanks Vinoth