This does look very promising. It makes sense to enable this mode through 
configuration.
Balaji.V    On Friday, May 3, 2019, 8:18:44 AM PDT, Vinoth Chandar 
<vin...@apache.org> wrote:  
 
 Hello all,

Noticed in a recent run that there were some skews on the bloom filter
checking stage. I noticed that even though sort based partitioning
uniformly distributes the records among partitions, the cost is controlled
by number of file groups being checked in one partitioning..

I chose to prototype a file group based custom partitioner, with the
intention of distributing this more evenly.. I am seeing consistently good
results

for e.g
```
Metric      Min 25th percentile Median 75th percentile Max
Duration 2 s 14 s                        48 s 1.6 min                3.9
min
```
becomes
```
Metric      Min 25th percentile Median 75th percentile Max
Duration      21 s 40 s                44 s 49 s
 1.9 min
```
I can just make this a configuration per se.. So probably worth getting it
in and iterating?
If y'all think so, will prep a PR. HUDI-108 tracks this

Thanks
Vinoth
  

Reply via email to