Re: Running cartesian joins on Drill

Zelaine Fong Thu, 11 May 2017 08:06:11 -0700

Provided `planner.enable_nljoin_for_scalar_only` is set to false, even without 
an explicit join condition, the query should use the Cartesian join/nested loop 
join.


-- Zelaine

On 5/11/17, 4:20 AM, "Anup Tiwari" <[email protected]> wrote:

    Hi,
    
    I have one question here.. so if we have to use Cartesian join in Drill
    then do we have to follow some workaround like Shadi mention : adding a
    dummy column on the fly that has the value 1 in both tables and then join
    on that column leading to having a match of every row of the first table
    with every row of the second table, hence do a Cartesian product?
    OR
    If we just don't specify join condition like :
    select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this
    query as Cartesian join.
    
    Regards,
    *Anup Tiwari*
    
    On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <[email protected]> wrote:
    
    > Cartesian joins in Drill are implemented as nested loop joins, and I think
    > you should see that reflected in the resultant query plan when you run
    > explain plan on the query.
    >
    > Yes, Cartesian joins/nested loop joins are expensive because you’re
    > effectively doing an MxN read of your tables.  There are more efficient
    > ways of processing a nested loop join, e.g., by creating an index on the
    > larger table in the join and then using that index to do lookups into that
    > table.  That way, the nested loop join cost is the cost of creating the
    > index + M, where M is the number of rows in the smaller table and assuming
    > the lookup cost into the index does minimize the amount of data read of 
the
    > second table.  Drill currently doesn’t do this.
    >
    > -- Zelaine
    >
    > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <[email protected]> wrote:
    >
    >     I believe clhubert is referring to this discussion
    >     <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/
    > cartesian-product-in-apache-drill#post1>
    >     .
    >
    >     So why Drill doesn't transform this query into a nested join query ?
    > Simply
    >     because there is no Calcite rule to transform it into a nested loop
    > join ?
    >     Is it not technically possible to write such Rule or is it feasible so
    > I
    >     may take on this challenge ?
    >
    >     Also pardon me for repeating my question but I fail to find an answer
    > in
    >     your replies, why doesn't Drill just run a cartesian join ? Because
    > it's
    >     expensive regarding resources (i.e. CPU\Network\RAM) ?
    >
    >     Thanks a lot Shadi for the query, it works for me.
    >
    >     *---------------------*
    >     *Muhammad Gelbana*
    >     http://www.linkedin.com/in/mgelbana
    >
    >     On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <[email protected]>
    > wrote:
    >
    >     > Hi Muhammad,
    >     >
    >     > I did the following as a workaround to have Cartesian product. The
    > basic
    >     > idea is to create a dummy column on the fly that has the value 1 in
    > both
    >     > tables and then join on that column leading to having a match of
    > every row
    >     > of the first table with every row of the second table, hence do a
    > Cartesian
    >     > product. This might not be the most efficient way but it will do the
    > job.
    >     >
    >     > *Original Query:*
    >     > SELECT * FROM
    >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    >     > 2147483647) `t0`
    >     > INNER JOIN
    >     > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT
    >     > 2147483647) `t1`
    >     > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`)
    >     > LIMIT 2147483647
    >     >
    >     > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables
    > one
    >     > and two, respectively. Names don't really matter, just need to be
    > unique):*
    >     > SELECT * FROM
    >     > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0`
    >     > INNER JOIN
    >     > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1`
    >     > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*)
    >     > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`
    >     > LIMIT 2147483647
    >     >
    >     > Regards
    >     >
    >     >
    >     > *Shadi Khalifa, PhD*
    >     > Postdoctoral Fellow
    >     > Cognitive Analytics Development Hub
    >     > Centre for Advanced Computing
    >     > Queen’s University
    >     > (613) 533-6000 x78347
    >     > http://cac.queensu.ca
    >     >
    >     > I'm just a neuron in the society collective brain
    >     >
    >     > *Join us for HPCS in June 2017! Register at:*  *http://2017.hpcs.ca/
    >     > <http://2017.hpcs.ca/>*
    >     >
    >     > P Please consider your environmental responsibility before printing
    > this
    >     > e-mail
    >     >
    >     > *01001001 00100000 01101100 01101111 01110110 01100101 00100000
    > 01000101
    >     > 01100111 01111001 01110000 01110100 *
    >     >
    >     > *The information transmitted is intended only for the person or
    > entity to
    >     > which it is addressed and may contain confidential material. Any
    > review or
    >     > dissemination of this information by persons other than the intended
    >     > recipient is prohibited. If you received this in error, please
    > contact the
    >     > sender and delete the material from any computer. Thank you.*
    >     >
    >     >
    >     >
    >     > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana <
    > [email protected]>
    >     > wrote:
    >     >
    >     >
    >     > 
    >     > Here it is:
    >     >
    >     > SELECT * FROM (SELECT 'ABC' `UserID` FROM
    > `dfs`.`path_to_parquet_file` tc
    >     > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM
    >     > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON (
    >     > 
    >     > `t0`.`UserID` IS NOT DISTINCT FROM
    >     > 
    >     > `t1`.`UserID`) LIMIT 2147483647
    >     >
    >     > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM*
    > into
    >     > 
    >     > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL &&
    > `t1`.`UserID`
    >     > IS NULL**)* while checking if the query is a cartesian join, and
    > when the
    >     > check returns true, it throws an excetion saying: *This query cannot
    > be
    >     > planned possibly due to either a cartesian join or an inequality
    > join*
    >     >
    >     >
    >     > *---------------------*
    >     > *Muhammad Gelbana*
    >     > http://www.linkedin.com/in/mgelbana
    >     >
    >     > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <[email protected]>
    > wrote:
    >     >
    >     > > Can you please specify the query you are trying to execute?
    >     > >
    >     > >
    >     > > Gautam
    >     > >
    >     > > ________________________________
    >     > > From: Muhammad Gelbana <[email protected]>
    >     > > Sent: Saturday, May 6, 2017 7:34:53 AM
    >     > > To: [email protected]; [email protected]
    >     > > Subject: Running cartesian joins on Drill
    >     > >
    >     > > Is there a reason why Drill would intentionally reject cartesian
    > join
    >     > > queries even if *planner.enable_nljoin_for_scalar_only* is
    > disabled ?
    >     > >
    >     > > Any ideas how could a query be rewritten to overcome this
    > restriction ?
    >     > >
    >     > > *---------------------*
    >     > > *Muhammad Gelbana*
    >     > > http://www.linkedin.com/in/mgelbana
    >     > >
    >     >
    >     >
    >     >
    >
    >
    >

Re: Running cartesian joins on Drill

Reply via email to