Nice, we needed to take a fresher look at the indexing stages, great start! The results look promising. Looks like the Min 24th percentile bumped but that's expected since the cost moved from the highest tasks to the other ones. Eventually, this will bring down the skew and stage time.
BTW, in case of fewer number of files where sort partitioning may work better, will we see higher stage times ? I'm guessing the increase in those depends on the number of files vs number of records ? Is there scope to parallelize per partition if opening multiple handles is ok ? We can do this for jobs that are running > 1 cores ? -Nishith On Fri, May 3, 2019 at 9:00 AM vbal...@apache.org <vbal...@apache.org> wrote: > > This does look very promising. It makes sense to enable this mode through > configuration. > Balaji.V On Friday, May 3, 2019, 8:18:44 AM PDT, Vinoth Chandar < > vin...@apache.org> wrote: > > Hello all, > > Noticed in a recent run that there were some skews on the bloom filter > checking stage. I noticed that even though sort based partitioning > uniformly distributes the records among partitions, the cost is controlled > by number of file groups being checked in one partitioning.. > > I chose to prototype a file group based custom partitioner, with the > intention of distributing this more evenly.. I am seeing consistently good > results > > for e.g > ``` > Metric Min 25th percentile Median 75th percentile Max > Duration 2 s 14 s 48 s 1.6 min 3.9 > min > ``` > becomes > ``` > Metric Min 25th percentile Median 75th percentile Max > Duration 21 s 40 s 44 s 49 s > 1.9 min > ``` > I can just make this a configuration per se.. So probably worth getting it > in and iterating? > If y'all think so, will prep a PR. HUDI-108 tracks this > > Thanks > Vinoth >