Provided `planner.enable_nljoin_for_scalar_only` is set to false, even without an explicit join condition, the query should use the Cartesian join/nested loop join.
-- Zelaine On 5/11/17, 4:20 AM, "Anup Tiwari" <[email protected]> wrote: Hi, I have one question here.. so if we have to use Cartesian join in Drill then do we have to follow some workaround like Shadi mention : adding a dummy column on the fly that has the value 1 in both tables and then join on that column leading to having a match of every row of the first table with every row of the second table, hence do a Cartesian product? OR If we just don't specify join condition like : select a.*, b.* from tt1 as a, tt2 b; then will it internally treat this query as Cartesian join. Regards, *Anup Tiwari* On Mon, May 8, 2017 at 10:00 PM, Zelaine Fong <[email protected]> wrote: > Cartesian joins in Drill are implemented as nested loop joins, and I think > you should see that reflected in the resultant query plan when you run > explain plan on the query. > > Yes, Cartesian joins/nested loop joins are expensive because you’re > effectively doing an MxN read of your tables. There are more efficient > ways of processing a nested loop join, e.g., by creating an index on the > larger table in the join and then using that index to do lookups into that > table. That way, the nested loop join cost is the cost of creating the > index + M, where M is the number of rows in the smaller table and assuming > the lookup cost into the index does minimize the amount of data read of the > second table. Drill currently doesn’t do this. > > -- Zelaine > > On 5/8/17, 9:09 AM, "Muhammad Gelbana" <[email protected]> wrote: > > I believe clhubert is referring to this discussion > <http://drill-user.incubator.apache.narkive.com/TIXWiTY4/ > cartesian-product-in-apache-drill#post1> > . > > So why Drill doesn't transform this query into a nested join query ? > Simply > because there is no Calcite rule to transform it into a nested loop > join ? > Is it not technically possible to write such Rule or is it feasible so > I > may take on this challenge ? > > Also pardon me for repeating my question but I fail to find an answer > in > your replies, why doesn't Drill just run a cartesian join ? Because > it's > expensive regarding resources (i.e. CPU\Network\RAM) ? > > Thanks a lot Shadi for the query, it works for me. > > *---------------------* > *Muhammad Gelbana* > http://www.linkedin.com/in/mgelbana > > On Mon, May 8, 2017 at 6:10 AM, Shadi Khalifa <[email protected]> > wrote: > > > Hi Muhammad, > > > > I did the following as a workaround to have Cartesian product. The > basic > > idea is to create a dummy column on the fly that has the value 1 in > both > > tables and then join on that column leading to having a match of > every row > > of the first table with every row of the second table, hence do a > Cartesian > > product. This might not be the most efficient way but it will do the > job. > > > > *Original Query:* > > SELECT * FROM > > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT > > 2147483647) `t0` > > INNER JOIN > > ( SELECT 'ABC' `UserID` FROM `dfs`.`path_to_parquet_file` tc LIMIT > > 2147483647) `t1` > > ON (`t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID`) > > LIMIT 2147483647 > > > > *Workaround (add columns **d1a381f3g73 and **d1a381f3g74 to tables > one > > and two, respectively. Names don't really matter, just need to be > unique):* > > SELECT * FROM > > ( SELECT *1 as d1a381f3g73*, 'ABC' `UserID` FROM > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t0` > > INNER JOIN > > ( SELECT *1 as d1a381f3g74*, 'ABC' `UserID` FROM > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` > > ON (`t0`.*d1a381f3g73 = *`t1`.*d1a381f3g74*) > > WHERE `t0`.`UserID` IS NOT DISTINCT FROM `t1`.`UserID` > > LIMIT 2147483647 > > > > Regards > > > > > > *Shadi Khalifa, PhD* > > Postdoctoral Fellow > > Cognitive Analytics Development Hub > > Centre for Advanced Computing > > Queen’s University > > (613) 533-6000 x78347 > > http://cac.queensu.ca > > > > I'm just a neuron in the society collective brain > > > > *Join us for HPCS in June 2017! Register at:* *http://2017.hpcs.ca/ > > <http://2017.hpcs.ca/>* > > > > P Please consider your environmental responsibility before printing > this > > e-mail > > > > *01001001 00100000 01101100 01101111 01110110 01100101 00100000 > 01000101 > > 01100111 01111001 01110000 01110100 * > > > > *The information transmitted is intended only for the person or > entity to > > which it is addressed and may contain confidential material. Any > review or > > dissemination of this information by persons other than the intended > > recipient is prohibited. If you received this in error, please > contact the > > sender and delete the material from any computer. Thank you.* > > > > > > > > On Saturday, May 6, 2017 6:05 PM, Muhammad Gelbana < > [email protected]> > > wrote: > > > > > > > > Here it is: > > > > SELECT * FROM (SELECT 'ABC' `UserID` FROM > `dfs`.`path_to_parquet_file` tc > > LIMIT 2147483647) `t0` INNER JOIN (SELECT 'ABC' `UserID` FROM > > `dfs`.`path_to_parquet_file` tc LIMIT 2147483647) `t1` ON ( > > > > `t0`.`UserID` IS NOT DISTINCT FROM > > > > `t1`.`UserID`) LIMIT 2147483647 > > > > I debugged Drill code and found it decomposes *IS NOT DISTINCT FROM* > into > > > > *`t0`.`UserID` = `t1`.`UserID` OR (`t0`.`UserID` IS NULL && > `t1`.`UserID` > > IS NULL**)* while checking if the query is a cartesian join, and > when the > > check returns true, it throws an excetion saying: *This query cannot > be > > planned possibly due to either a cartesian join or an inequality > join* > > > > > > *---------------------* > > *Muhammad Gelbana* > > http://www.linkedin.com/in/mgelbana > > > > On Sat, May 6, 2017 at 6:53 PM, Gautam Parai <[email protected]> > wrote: > > > > > Can you please specify the query you are trying to execute? > > > > > > > > > Gautam > > > > > > ________________________________ > > > From: Muhammad Gelbana <[email protected]> > > > Sent: Saturday, May 6, 2017 7:34:53 AM > > > To: [email protected]; [email protected] > > > Subject: Running cartesian joins on Drill > > > > > > Is there a reason why Drill would intentionally reject cartesian > join > > > queries even if *planner.enable_nljoin_for_scalar_only* is > disabled ? > > > > > > Any ideas how could a query be rewritten to overcome this > restriction ? > > > > > > *---------------------* > > > *Muhammad Gelbana* > > > http://www.linkedin.com/in/mgelbana > > > > > > > > > > > >
