Re: Drill query planning error

Julian Hyde Wed, 26 Jul 2017 17:18:11 -0700

I think the “multiple distinct count” code path also gets triggered if there’s 
a mixture of distinct and non-distinct aggregates.


> On Jul 26, 2017, at 5:16 PM, Aman Sinha <[email protected]> wrote:
> 
> It sounds like you have narrowed it down further and it could be an issue
> with the scalar check when 2 or more cartesian joins are present.
> Please file a JIRA with relevant details.
> 
> On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <[email protected] 
> <mailto:[email protected]>>
> wrote:
> 
>> Thanks for pointing out the possible reasons @Aman @Julian . I am not sure
>> that's the real problem .
>> 
>> As I initially mentioned,sorry for not clearly described ,queries like
>> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
>> right.but "select count(*),sum(a),count (distinct b),count(distinct c) from
>> t where dt=xxx"will be wrong.  The only one count(distinct) scenario does
>> do the Cartesian join check too.It will and a broadcast operator on my
>> transferred relnode. So I think I have not loose the scalar property. But
>> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
>> check for satisfying the Cartesian conditions happened only once.It sounds
>> should happen twice as there are two nested loop join operators at the two
>> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
>> 
>> Best Regards
>> 
>> 
>> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks for
>>> pointing that out.   I'll see if we can leverage that.  The grouping sets
>>> is not yet supported in Drill, but that would be a better solution since
>> it
>>> avoids extra scans.
>>> 
>>> -Aman
>>> 
>>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> Aman,
>>>> 
>>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill
>> question.
>>>> 
>>>> A blanket restriction on cartesian joins is a blunt instrument.
>> Sometimes
>>>> cartesian joins are valid, safe, and the best plan for a query. This
>> is a
>>>> case in point. Users shouldn’t have to change config parameters to get
>> it
>>>> to work.
>>>> 
>>>> (Actually I don’t know the query, but
>>>> 
>>>>  select count(distinct deptno), count(distinct gender) from emp
>>>> 
>>>> is equivalent.)
>>>> 
>>>> Drill should detect that a relational expression can return at most one
>>>> row, and allow a cartesian join if one side is such. Calcite has a
>>>> RelMdMaxRowCount statistic for this. This was added as part of
>>>> http://issues.apache.org/jira/browse/CALCITE-604 <
>>>> http://issues.apache.org/jira/browse/CALCITE-604 
>>>> <http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100%
>>>> safe. No config parameters required.
>>>> 
>>>> Also, Calcite has an alternative way of handling multiple distinct
>>>> aggregates that rewrites to use grouping sets. It doesn’t generate
>>>> self-joins, cartesian or otherwise.  http://issues.apache.org/jira/ 
>>>> <http://issues.apache.org/jira/>
>>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732 
>>>> <http://issues.apache.org/jira/browse/CALCITE-732>>.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <[email protected]>
>> wrote:
>>>>> 
>>>>> [Since this is Drill specific, I put dev@calcite on BCC].
>>>>> 
>>>>> If you have two aggregates: Count(distinct a), Count(distinct b), the
>>>>> Calcite logical plan consists of a cartesian join of 2 subqueries
>> each
>>> of
>>>>> which first does a group-by on the distinct column followed by a
>> count
>>>>> aggregate.   By default,  Drill only processes cartesian join if one
>>>> input
>>>>> of the join is known to be scalar (single row).  It sounds like after
>>> you
>>>>> did the transformation to use the cache, that scalar property somehow
>>> did
>>>>> not get propagated.
>>>>> You can override this behavior by a session configuration:  (this
>> will
>>>> use
>>>>> a nested loop join even if inputs are not provably scalar, but it
>>> should
>>>> be
>>>>> used for specific query only).    For a more general solution, I
>>> believe
>>>>> you may have to create an enhancement JIRA with appropriate details.
>>>>>  'alter session set planner.enable_nljoin_for_scalar_only = false';
>>>>> 
>>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> HI all:
>>>>>> 
>>>>>> I materialize the count distinct query result to a cache, then when
>>>> user
>>>>>> query the count distinct , a specific rule will translate the query
>> to
>>>> the
>>>>>> cache. It turns out right when the query has only one count
>> (distinct
>>> )
>>>>>> operator ,but when it has two count (distinct ) ,it causes error
>> .The
>>>> error
>>>>>> info is here:
>>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
>>>>>> 
>>>>>> 
>>>>>> Best Regards.

Re: Drill query planning error

Reply via email to