Aman,

Thanks for moving dev@calcite to Bcc. This is properly a Drill question.

A blanket restriction on cartesian joins is a blunt instrument. Sometimes 
cartesian joins are valid, safe, and the best plan for a query. This is a case 
in point. Users shouldn’t have to change config parameters to get it to work.

(Actually I don’t know the query, but

  select count(distinct deptno), count(distinct gender) from emp 

is equivalent.)

Drill should detect that a relational expression can return at most one row, 
and allow a cartesian join if one side is such. Calcite has a RelMdMaxRowCount 
statistic for this. This was added as part of 
http://issues.apache.org/jira/browse/CALCITE-604 
<http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100% safe. No 
config parameters required.

Also, Calcite has an alternative way of handling multiple distinct aggregates 
that rewrites to use grouping sets. It doesn’t generate self-joins, cartesian 
or otherwise.  http://issues.apache.org/jira/browse/CALCITE-732 
<http://issues.apache.org/jira/browse/CALCITE-732>. 

Julian






> On Jul 26, 2017, at 9:20 AM, Aman Sinha <[email protected]> wrote:
> 
> [Since this is Drill specific, I put dev@calcite on BCC].
> 
> If you have two aggregates: Count(distinct a), Count(distinct b), the
> Calcite logical plan consists of a cartesian join of 2 subqueries each of
> which first does a group-by on the distinct column followed by a count
> aggregate.   By default,  Drill only processes cartesian join if one input
> of the join is known to be scalar (single row).  It sounds like after you
> did the transformation to use the cache, that scalar property somehow did
> not get propagated.
> You can override this behavior by a session configuration:  (this will use
> a nested loop join even if inputs are not provably scalar, but it should be
> used for specific query only).    For a more general solution, I believe
> you may have to create an enhancement JIRA with appropriate details.
>   'alter session set planner.enable_nljoin_for_scalar_only = false';
> 
> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <[email protected]>
> wrote:
> 
>> HI all:
>> 
>>  I materialize the count distinct query result to a cache, then when user
>> query the count distinct , a specific rule will translate the query to the
>> cache. It turns out right when the query has only one count (distinct )
>> operator ,but when it has two count (distinct ) ,it causes error .The error
>> info is here:
>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
>> 
>> 
>> Best Regards.
>> 

Reply via email to