It sounds like you have narrowed it down further and it could be an issue with the scalar check when 2 or more cartesian joins are present. Please file a JIRA with relevant details.
On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <[email protected]> wrote: > Thanks for pointing out the possible reasons @Aman @Julian . I am not sure > that's the real problem . > > As I initially mentioned,sorry for not clearly described ,queries like > "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be > right.but "select count(*),sum(a),count (distinct b),count(distinct c) from > t where dt=xxx"will be wrong. The only one count(distinct) scenario does > do the Cartesian join check too.It will and a broadcast operator on my > transferred relnode. So I think I have not loose the scalar property. But > when I debug the two count(distinct ) , I find the NestedLoopJoinPrule > check for satisfying the Cartesian conditions happened only once.It sounds > should happen twice as there are two nested loop join operators at the two > count(distinct) scenarios .If it turns out a bug ,I will log a Jira . > > Best Regards > > > On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <[email protected]> wrote: > > > Yes, the RelMdMaxRowCount statistic would be useful for this. Thanks for > > pointing that out. I'll see if we can leverage that. The grouping sets > > is not yet supported in Drill, but that would be a better solution since > it > > avoids extra scans. > > > > -Aman > > > > On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <[email protected]> wrote: > > > > > Aman, > > > > > > Thanks for moving dev@calcite to Bcc. This is properly a Drill > question. > > > > > > A blanket restriction on cartesian joins is a blunt instrument. > Sometimes > > > cartesian joins are valid, safe, and the best plan for a query. This > is a > > > case in point. Users shouldn’t have to change config parameters to get > it > > > to work. > > > > > > (Actually I don’t know the query, but > > > > > > select count(distinct deptno), count(distinct gender) from emp > > > > > > is equivalent.) > > > > > > Drill should detect that a relational expression can return at most one > > > row, and allow a cartesian join if one side is such. Calcite has a > > > RelMdMaxRowCount statistic for this. This was added as part of > > > http://issues.apache.org/jira/browse/CALCITE-604 < > > > http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100% > > > safe. No config parameters required. > > > > > > Also, Calcite has an alternative way of handling multiple distinct > > > aggregates that rewrites to use grouping sets. It doesn’t generate > > > self-joins, cartesian or otherwise. http://issues.apache.org/jira/ > > > browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>. > > > > > > Julian > > > > > > > > > > > > > > > > > > > > > > On Jul 26, 2017, at 9:20 AM, Aman Sinha <[email protected]> > wrote: > > > > > > > > [Since this is Drill specific, I put dev@calcite on BCC]. > > > > > > > > If you have two aggregates: Count(distinct a), Count(distinct b), the > > > > Calcite logical plan consists of a cartesian join of 2 subqueries > each > > of > > > > which first does a group-by on the distinct column followed by a > count > > > > aggregate. By default, Drill only processes cartesian join if one > > > input > > > > of the join is known to be scalar (single row). It sounds like after > > you > > > > did the transformation to use the cache, that scalar property somehow > > did > > > > not get propagated. > > > > You can override this behavior by a session configuration: (this > will > > > use > > > > a nested loop join even if inputs are not provably scalar, but it > > should > > > be > > > > used for specific query only). For a more general solution, I > > believe > > > > you may have to create an enhancement JIRA with appropriate details. > > > > 'alter session set planner.enable_nljoin_for_scalar_only = false'; > > > > > > > > On Wed, Jul 26, 2017 at 4:14 AM, weijie tong < > [email protected]> > > > > wrote: > > > > > > > >> HI all: > > > >> > > > >> I materialize the count distinct query result to a cache, then when > > > user > > > >> query the count distinct , a specific rule will translate the query > to > > > the > > > >> cache. It turns out right when the query has only one count > (distinct > > ) > > > >> operator ,but when it has two count (distinct ) ,it causes error > .The > > > error > > > >> info is here: > > > >> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269 > > > >> > > > >> > > > >> Best Regards. > > > >> > > > > > > > > >
