This does look very promising. It makes sense to enable this mode through configuration. Balaji.V On Friday, May 3, 2019, 8:18:44 AM PDT, Vinoth Chandar <vin...@apache.org> wrote: Hello all,
Noticed in a recent run that there were some skews on the bloom filter checking stage. I noticed that even though sort based partitioning uniformly distributes the records among partitions, the cost is controlled by number of file groups being checked in one partitioning.. I chose to prototype a file group based custom partitioner, with the intention of distributing this more evenly.. I am seeing consistently good results for e.g ``` Metric Min 25th percentile Median 75th percentile Max Duration 2 s 14 s 48 s 1.6 min 3.9 min ``` becomes ``` Metric Min 25th percentile Median 75th percentile Max Duration 21 s 40 s 44 s 49 s 1.9 min ``` I can just make this a configuration per se.. So probably worth getting it in and iterating? If y'all think so, will prep a PR. HUDI-108 tracks this Thanks Vinoth