I think the “multiple distinct count” code path also gets triggered if there’s a mixture of distinct and non-distinct aggregates.
> On Jul 26, 2017, at 5:16 PM, Aman Sinha <[email protected]> wrote: > > It sounds like you have narrowed it down further and it could be an issue > with the scalar check when 2 or more cartesian joins are present. > Please file a JIRA with relevant details. > > On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <[email protected] > <mailto:[email protected]>> > wrote: > >> Thanks for pointing out the possible reasons @Aman @Julian . I am not sure >> that's the real problem . >> >> As I initially mentioned,sorry for not clearly described ,queries like >> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be >> right.but "select count(*),sum(a),count (distinct b),count(distinct c) from >> t where dt=xxx"will be wrong. The only one count(distinct) scenario does >> do the Cartesian join check too.It will and a broadcast operator on my >> transferred relnode. So I think I have not loose the scalar property. But >> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule >> check for satisfying the Cartesian conditions happened only once.It sounds >> should happen twice as there are two nested loop join operators at the two >> count(distinct) scenarios .If it turns out a bug ,I will log a Jira . >> >> Best Regards >> >> >> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <[email protected] >> <mailto:[email protected]>> wrote: >> >>> Yes, the RelMdMaxRowCount statistic would be useful for this. Thanks for >>> pointing that out. I'll see if we can leverage that. The grouping sets >>> is not yet supported in Drill, but that would be a better solution since >> it >>> avoids extra scans. >>> >>> -Aman >>> >>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>>> Aman, >>>> >>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill >> question. >>>> >>>> A blanket restriction on cartesian joins is a blunt instrument. >> Sometimes >>>> cartesian joins are valid, safe, and the best plan for a query. This >> is a >>>> case in point. Users shouldn’t have to change config parameters to get >> it >>>> to work. >>>> >>>> (Actually I don’t know the query, but >>>> >>>> select count(distinct deptno), count(distinct gender) from emp >>>> >>>> is equivalent.) >>>> >>>> Drill should detect that a relational expression can return at most one >>>> row, and allow a cartesian join if one side is such. Calcite has a >>>> RelMdMaxRowCount statistic for this. This was added as part of >>>> http://issues.apache.org/jira/browse/CALCITE-604 < >>>> http://issues.apache.org/jira/browse/CALCITE-604 >>>> <http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100% >>>> safe. No config parameters required. >>>> >>>> Also, Calcite has an alternative way of handling multiple distinct >>>> aggregates that rewrites to use grouping sets. It doesn’t generate >>>> self-joins, cartesian or otherwise. http://issues.apache.org/jira/ >>>> <http://issues.apache.org/jira/> >>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732 >>>> <http://issues.apache.org/jira/browse/CALCITE-732>>. >>>> >>>> Julian >>>> >>>> >>>> >>>> >>>> >>>> >>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <[email protected]> >> wrote: >>>>> >>>>> [Since this is Drill specific, I put dev@calcite on BCC]. >>>>> >>>>> If you have two aggregates: Count(distinct a), Count(distinct b), the >>>>> Calcite logical plan consists of a cartesian join of 2 subqueries >> each >>> of >>>>> which first does a group-by on the distinct column followed by a >> count >>>>> aggregate. By default, Drill only processes cartesian join if one >>>> input >>>>> of the join is known to be scalar (single row). It sounds like after >>> you >>>>> did the transformation to use the cache, that scalar property somehow >>> did >>>>> not get propagated. >>>>> You can override this behavior by a session configuration: (this >> will >>>> use >>>>> a nested loop join even if inputs are not provably scalar, but it >>> should >>>> be >>>>> used for specific query only). For a more general solution, I >>> believe >>>>> you may have to create an enhancement JIRA with appropriate details. >>>>> 'alter session set planner.enable_nljoin_for_scalar_only = false'; >>>>> >>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong < >> [email protected]> >>>>> wrote: >>>>> >>>>>> HI all: >>>>>> >>>>>> I materialize the count distinct query result to a cache, then when >>>> user >>>>>> query the count distinct , a specific rule will translate the query >> to >>>> the >>>>>> cache. It turns out right when the query has only one count >> (distinct >>> ) >>>>>> operator ,but when it has two count (distinct ) ,it causes error >> .The >>>> error >>>>>> info is here: >>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269 >>>>>> >>>>>> >>>>>> Best Regards.
