Aman, Thanks for moving dev@calcite to Bcc. This is properly a Drill question.
A blanket restriction on cartesian joins is a blunt instrument. Sometimes cartesian joins are valid, safe, and the best plan for a query. This is a case in point. Users shouldn’t have to change config parameters to get it to work. (Actually I don’t know the query, but select count(distinct deptno), count(distinct gender) from emp is equivalent.) Drill should detect that a relational expression can return at most one row, and allow a cartesian join if one side is such. Calcite has a RelMdMaxRowCount statistic for this. This was added as part of http://issues.apache.org/jira/browse/CALCITE-604 <http://issues.apache.org/jira/browse/CALCITE-604>. This rule is 100% safe. No config parameters required. Also, Calcite has an alternative way of handling multiple distinct aggregates that rewrites to use grouping sets. It doesn’t generate self-joins, cartesian or otherwise. http://issues.apache.org/jira/browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732>. Julian > On Jul 26, 2017, at 9:20 AM, Aman Sinha <[email protected]> wrote: > > [Since this is Drill specific, I put dev@calcite on BCC]. > > If you have two aggregates: Count(distinct a), Count(distinct b), the > Calcite logical plan consists of a cartesian join of 2 subqueries each of > which first does a group-by on the distinct column followed by a count > aggregate. By default, Drill only processes cartesian join if one input > of the join is known to be scalar (single row). It sounds like after you > did the transformation to use the cache, that scalar property somehow did > not get propagated. > You can override this behavior by a session configuration: (this will use > a nested loop join even if inputs are not provably scalar, but it should be > used for specific query only). For a more general solution, I believe > you may have to create an enhancement JIRA with appropriate details. > 'alter session set planner.enable_nljoin_for_scalar_only = false'; > > On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <[email protected]> > wrote: > >> HI all: >> >> I materialize the count distinct query result to a cache, then when user >> query the count distinct , a specific rule will translate the query to the >> cache. It turns out right when the query has only one count (distinct ) >> operator ,but when it has two count (distinct ) ,it causes error .The error >> info is here: >> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269 >> >> >> Best Regards. >>
