>> BTW, in case of fewer number of files where sort partitioning may
work better, will we see higher stage times ?
I think this will always perform equally or faster, since it just trades
off more parallelism in opening file handles.
Sort-based was being nicer to our HDFS NameNode :)

>> We can do this for jobs that are running > 1 cores ?
This should have no bearing on this setting.. We can discuss more on the PR
with adequate context

On Fri, May 3, 2019 at 9:21 AM nishith agarwal <n3.nas...@gmail.com> wrote:

> Nice, we needed to take a fresher look at the indexing stages, great start!
> The results look promising. Looks like the Min 24th percentile bumped but
> that's expected since the cost moved from the highest tasks to the other
> ones. Eventually, this will bring down the skew and stage time.
>
> BTW, in case of fewer number of files where sort partitioning may work
> better, will we see higher stage times ? I'm guessing the increase in those
> depends on the number of files vs number of records ? Is there scope to
> parallelize per partition if opening multiple handles is ok ? We can do
> this for jobs that are running > 1 cores ?
>
> -Nishith
>
> On Fri, May 3, 2019 at 9:00 AM vbal...@apache.org <vbal...@apache.org>
> wrote:
>
> >
> > This does look very promising. It makes sense to enable this mode through
> > configuration.
> > Balaji.V    On Friday, May 3, 2019, 8:18:44 AM PDT, Vinoth Chandar <
> > vin...@apache.org> wrote:
> >
> >  Hello all,
> >
> > Noticed in a recent run that there were some skews on the bloom filter
> > checking stage. I noticed that even though sort based partitioning
> > uniformly distributes the records among partitions, the cost is
> controlled
> > by number of file groups being checked in one partitioning..
> >
> > I chose to prototype a file group based custom partitioner, with the
> > intention of distributing this more evenly.. I am seeing consistently
> good
> > results
> >
> > for e.g
> > ```
> > Metric      Min 25th percentile Median 75th percentile Max
> > Duration 2 s 14 s                        48 s 1.6 min                3.9
> > min
> > ```
> > becomes
> > ```
> > Metric      Min 25th percentile Median 75th percentile Max
> > Duration      21 s 40 s                44 s 49 s
> >  1.9 min
> > ```
> > I can just make this a configuration per se.. So probably worth getting
> it
> > in and iterating?
> > If y'all think so, will prep a PR. HUDI-108 tracks this
> >
> > Thanks
> > Vinoth
> >
>

Reply via email to