[ https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734395#comment-14734395 ]
Cheng Hao commented on SPARK-10484: ----------------------------------- In cartesian produce implementation, there is 2 level nested loops, and exchanging the order of the join tables, will reduce the outer loop times(with smaller table), but increase the looping times of the inner loop(bigger table), this is actually a manually optimization for the sql query. I created a PR for optimizing the cartesian join by involving the broadcast join. > [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables > do cross join > ----------------------------------------------------------------------------------------- > > Key: SPARK-10484 > URL: https://issues.apache.org/jira/browse/SPARK-10484 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Yi Zhou > Priority: Critical > > Found that it lost task or GC OOM when below cross join happen. The left big > table is ~1.2G in size and the right small table is ~2.2K. > Key SQL > {code:sql} > SELECT > CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, > pr_review_date, > pr_review_content > FROM product_reviews pr, temp_stores_with_regression stores_with_regression > WHERE locate(lower(stores_with_regression.s_store_name), > lower(pr.pr_review_content), 1) >= 1 ; > {code} > Physical Plan > {code:sql} > TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) > AS store_ID#446,pr_review_date#449,pr_review_content#455] > Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1) > CartesianProduct > HiveTableScan [pr_review_date#449,pr_review_content#455], > (MetastoreRelation bigbench, product_reviews, Some(pr)) > HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation > bigbench, temp_stores_with_regression, Some(stores_with_regression)) > Code Generation: true > {code} > We also found a strange behavior that exchanging the two table in 'From' > clause can pass. > Key SQL > {code:sql} > SELECT > CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, > pr_review_date, > pr_review_content > FROM temp_stores_with_regression stores_with_regression, product_reviews pr > WHERE locate(lower(stores_with_regression.s_store_name), > lower(pr.pr_review_content), 1) >= 1 ; > {code} > Physical Plan > {code:sql} > TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) > AS store_ID#446,pr_review_date#451,pr_review_content#457] > Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1) > CartesianProduct > HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation > bigbench, temp_stores_with_regression, Some(stores_with_regression)) > HiveTableScan [pr_review_date#451,pr_review_content#457], > (MetastoreRelation bigbench, product_reviews, Some(pr)) > Code Generation: true > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org