Nice, we needed to take a fresher look at the indexing stages, great start!
The results look promising. Looks like the Min 24th percentile bumped but
that's expected since the cost moved from the highest tasks to the other
ones. Eventually, this will bring down the skew and stage time.

BTW, in case of fewer number of files where sort partitioning may work
better, will we see higher stage times ? I'm guessing the increase in those
depends on the number of files vs number of records ? Is there scope to
parallelize per partition if opening multiple handles is ok ? We can do
this for jobs that are running > 1 cores ?

-Nishith

On Fri, May 3, 2019 at 9:00 AM vbal...@apache.org <vbal...@apache.org>
wrote:

>
> This does look very promising. It makes sense to enable this mode through
> configuration.
> Balaji.V    On Friday, May 3, 2019, 8:18:44 AM PDT, Vinoth Chandar <
> vin...@apache.org> wrote:
>
>  Hello all,
>
> Noticed in a recent run that there were some skews on the bloom filter
> checking stage. I noticed that even though sort based partitioning
> uniformly distributes the records among partitions, the cost is controlled
> by number of file groups being checked in one partitioning..
>
> I chose to prototype a file group based custom partitioner, with the
> intention of distributing this more evenly.. I am seeing consistently good
> results
>
> for e.g
> ```
> Metric      Min 25th percentile Median 75th percentile Max
> Duration 2 s 14 s                        48 s 1.6 min                3.9
> min
> ```
> becomes
> ```
> Metric      Min 25th percentile Median 75th percentile Max
> Duration      21 s 40 s                44 s 49 s
>  1.9 min
> ```
> I can just make this a configuration per se.. So probably worth getting it
> in and iterating?
> If y'all think so, will prep a PR. HUDI-108 tracks this
>
> Thanks
> Vinoth
>

Reply via email to