It sounds like you have narrowed it down further and it could be an issue
with the scalar check when 2 or more cartesian joins are present.
Please file a JIRA with relevant details.

On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <[email protected]>
wrote:

> Thanks for pointing out the possible reasons @Aman @Julian . I am not sure
> that's the real problem .
>
> As I initially mentioned,sorry for not clearly described ,queries like
> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
> right.but "select count(*),sum(a),count (distinct b),count(distinct c) from
> t where dt=xxx"will be wrong.  The only one count(distinct) scenario does
> do the Cartesian join check too.It will and a broadcast operator on my
> transferred relnode. So I think I have not loose the scalar property. But
> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
>  check for satisfying the Cartesian conditions happened only once.It sounds
> should happen twice as there are two nested loop join operators at the two
> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
>
> Best Regards
>
>
> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <[email protected]> wrote:
>
> > Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks for
> > pointing that out.   I'll see if we can leverage that.  The grouping sets
> > is not yet supported in Drill, but that would be a better solution since
> it
> > avoids extra scans.
> >
> > -Aman
> >
> > On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <[email protected]> wrote:
> >
> > > Aman,
> > >
> > > Thanks for moving dev@calcite to Bcc. This is properly a Drill
> question.
> > >
> > > A blanket restriction on cartesian joins is a blunt instrument.
> Sometimes
> > > cartesian joins are valid, safe, and the best plan for a query. This
> is a
> > > case in point. Users shouldn’t have to change config parameters to get
> it
> > > to work.
> > >
> > > (Actually I don’t know the query, but
> > >
> > >   select count(distinct deptno), count(distinct gender) from emp
> > >
> > > is equivalent.)
> > >
> > > Drill should detect that a relational expression can return at most one
> > > row, and allow a cartesian join if one side is such. Calcite has a
> > > RelMdMaxRowCount statistic for this. This was added as part of
> > > http://issues.apache.org/jira/browse/CALCITE-604 <
> > > http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100%
> > > safe. No config parameters required.
> > >
> > > Also, Calcite has an alternative way of handling multiple distinct
> > > aggregates that rewrites to use grouping sets. It doesn’t generate
> > > self-joins, cartesian or otherwise.  http://issues.apache.org/jira/
> > > browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>.
> > >
> > > Julian
> > >
> > >
> > >
> > >
> > >
> > >
> > > > On Jul 26, 2017, at 9:20 AM, Aman Sinha <[email protected]>
> wrote:
> > > >
> > > > [Since this is Drill specific, I put dev@calcite on BCC].
> > > >
> > > > If you have two aggregates: Count(distinct a), Count(distinct b), the
> > > > Calcite logical plan consists of a cartesian join of 2 subqueries
> each
> > of
> > > > which first does a group-by on the distinct column followed by a
> count
> > > > aggregate.   By default,  Drill only processes cartesian join if one
> > > input
> > > > of the join is known to be scalar (single row).  It sounds like after
> > you
> > > > did the transformation to use the cache, that scalar property somehow
> > did
> > > > not get propagated.
> > > > You can override this behavior by a session configuration:  (this
> will
> > > use
> > > > a nested loop join even if inputs are not provably scalar, but it
> > should
> > > be
> > > > used for specific query only).    For a more general solution, I
> > believe
> > > > you may have to create an enhancement JIRA with appropriate details.
> > > >   'alter session set planner.enable_nljoin_for_scalar_only = false';
> > > >
> > > > On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
> [email protected]>
> > > > wrote:
> > > >
> > > >> HI all:
> > > >>
> > > >>  I materialize the count distinct query result to a cache, then when
> > > user
> > > >> query the count distinct , a specific rule will translate the query
> to
> > > the
> > > >> cache. It turns out right when the query has only one count
> (distinct
> > )
> > > >> operator ,but when it has two count (distinct ) ,it causes error
> .The
> > > error
> > > >> info is here:
> > > >> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
> > > >>
> > > >>
> > > >> Best Regards.
> > > >>
> > >
> > >
> >
>

Reply via email to