Re: Hash Agg vs Streaming Agg for a smaller data set

rahul challapalli Fri, 10 Jul 2015 17:08:29 -0700

That could be the reason, in the first query we are scanning 64000 records
and in the second case just 108 records. Thanks for the replies!


On Fri, Jul 10, 2015 at 4:48 PM, Jinfeng Ni <[email protected]> wrote:

> I'm not clear which column is the partitioning column. From what you
> described, row count of aggregator in the first case is larger than that in
> the second case, since the former one requires full table scan. Cost-wise,
> hash-agg would make more sense when the input is larger, since
> streaming-agg requires sort, which could be expensive for large dataset.
>
> My guess is the difference of rowcounts in the two cases cause the
> difference in the query plan.
>
> One suggestion. If you want to check query plan, it would make more sense
> to try with reasonably large data.  Drill's costing model is not fully
> calibrated yet;  a small dataset like tpch_0.0.1 might make it hard for the
> cost model to pick the right plan. On the other hand, if the dataset is
> small, two different plans normally would not make a big difference in
> terms of performance. In other words, try to use large dataset if you are
> interested in performance testing / plan verification.
>
>
>
>
>
> On Fri, Jul 10, 2015 at 4:27 PM, rahul challapalli <
> [email protected]> wrote:
>
> > Hi,
> >
> > Info about Data : The data is auto partitioned tpch 0.01 data. The second
> > filter is a non-partitioned column, so in the first case the 'OR'
> predicate
> > results in a full-table scan, while in the second case, partition pruning
> > takes effect.
> >
> > The first case results in a hash agg and the second case in a streaming
> > agg. Any idea why?
> >
> > 1. explain plan for select distinct l_modline, l_moddate from
> > `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> > '1992-01-01' or l_shipdate=date'1992-01-01';
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(l_modline=[$0], l_moddate=[$1])
> > 00-02        Project(l_modline=[$0], l_moddate=[$1])
> > 00-03          HashAgg(group=[{0, 1}])
> > 00-04            Project(l_modline=[$2], l_moddate=[$0])
> > 00-05              SelectionVectorRemover
> > 00-06                Filter(condition=[OR(=($0, 1992-01-01), =($1,
> > 1992-01-01))])
> > 00-07                  Project(l_moddate=[$2], l_shipdate=[$1],
> > l_modline=[$0])
> > 00-08                    Scan..........
> >
> > 2. explain plan for select distinct l_modline, l_moddate from
> > `tpch_multiple_partitions/lineitem_twopart` where l_moddate=date
> > '1992-01-01' and l_shipdate=date'1992-01-01';
> > +------+------+
> > | text | json |
> > +------+------+
> > | 00-00    Screen
> > 00-01      Project(l_modline=[$0], l_moddate=[$1])
> > 00-02        Project(l_modline=[$0], l_moddate=[$1])
> > 00-03          StreamAgg(group=[{0, 1}])
> > 00-04            Sort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC])
> > 00-05              Project(l_modline=[$2], l_moddate=[$0])
> > 00-06                SelectionVectorRemover
> > 00-07                  Filter(condition=[AND(=($0, 1992-01-01), =($1,
> > 1992-01-01))])
> > 00-08                    Project(l_moddate=[$2], l_shipdate=[$1],
> > l_modline=[$0])
> > 00-09                      Scan.....................
> >
> > - Rahul
> >
>

Re: Hash Agg vs Streaming Agg for a smaller data set

Reply via email to