Hello all,

Noticed in a recent run that there were some skews on the bloom filter
checking stage. I noticed that even though sort based partitioning
uniformly distributes the records among partitions, the cost is controlled
by number of file groups being checked in one partitioning..

I chose to prototype a file group based custom partitioner, with the
intention of distributing this more evenly.. I am seeing consistently good
results

for e.g
```
Metric       Min 25th percentile Median 75th percentile Max
Duration 2 s 14 s                         48 s 1.6 min                3.9
min
```
becomes
```
Metric       Min 25th percentile Median 75th percentile Max
Duration       21 s 40 s                 44 s 49 s
 1.9 min
```
I can just make this a configuration per se.. So probably worth getting it
in and iterating?
If y'all think so, will prep a PR. HUDI-108 tracks this

Thanks
Vinoth

Reply via email to