Re: Running cartesian joins on Drill

Zelaine Fong Sun, 07 May 2017 19:32:52 -0700

@chulbert is correct.  You have to disable 
planner.enable_nljoin_for_scalar_only in order for Drill to use nested loop 
joins.  Otherwise, it only does so when one of the join inputs is a scalar 
subquery, i.e., a single return row.


Drill requires disabling the parameter because nested loop joins are currently 
implemented as inefficient Cartesian joins, which can be very expensive.  The 
end user needs to make an explicit choice to use them.

-- Zelaine

On 5/6/17, 9:17 PM, "clhub...@gmail.com" <clhub...@gmail.com> wrote:

    As you've observed, we have to manually enable nested loops (which
    cartesian joins require).
    
    alter session set `planner.enable_nljoin_for_scalar_only` = false;
    
    I am guessing this approach was taken because
    
    1.  Although ANSI compliant SQL.  These type of joins are not typical in
    most applications.
    2.  To avoid the possibility of the cost based optimizer selecting a nested
    loop algorithm (expensive join algorithm) by mistake.  If this happens a
    lot, it would decrease overall performance.
    3.  Simple enough workaround (manually enabling/disabling) until the logic
    is developed to automatically enable nested loops when needed.
    
    If anyone knows for "sure", please chime in.
    
    OLAP Engines like Modrian have already been reported to generate queries
    that need Cartesian joins.
    https://issues.apache.org/jira/browse/DRILL-2915
    
    Hopefully this helps...
    
    For my use case, it's no big deal to enable it manually.    If performance
    ever becomes an issue, I can toggle it off and on as needed.
    
    I asked a similar question..   If you check back in the archives for this
    list
    (12/2016)  you will see the details.
    
    
    
    
    On Sat, May 6, 2017 at 3:05 PM, Muhammad Gelbana <m.gelb...@gmail.com>
    wrote:
    
    > 
    > Here it is:
    >
    > SELECT * FROM (SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc
    > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    > 
    > `t0`.`UserID` IS NOT DISTINCT FROM
    > 
    > `t1`.`UserID`) LIMIT 2147483647
    >
    > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* into
    > 
    > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && `t1`.`UserID`
    > IS NULL**)* while checking if the query is a cartesian join, and when the
    > check returns true, it throws an excetion saying: *This query cannot be
    > planned possibly due to either a cartesian join or an inequality join*
    >
    >
    > *---------------------*
    > *Muhammad Gelbana*
    > http://www.linkedin.com/in/mgelbana
    >
    > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <gpa...@mapr.com> wrote:
    >
    > > Can you please specify the query you are trying to execute?
    > >
    > >
    > > Gautam
    > >
    > > ________________________________
    > > From: Muhammad Gelbana <m.gelb...@gmail.com>
    > > Sent: Saturday, May 6, 2017 7:34:53 AM
    > > To: u...@drill.apache.org; dev@drill.apache.org
    > > Subject: Running cartesian joins on Drill
    > >
    > > Is there a reason why Drill would intentionally reject cartesian join
    > > queries even if *planner.enable_nljoin_for_scalar_only* is disabled ?
    > >
    > > Any ideas how could a query be rewritten to overcome this restriction ?
    > >
    > > *---------------------*
    > > *Muhammad Gelbana*
    > > http://www.linkedin.com/in/mgelbana
    > >
    >

Re: Running cartesian joins on Drill

Reply via email to