[jira] [Commented] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

Cheng Hao (JIRA) Tue, 08 Sep 2015 00:59:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734395#comment-14734395
 ]


Cheng Hao commented on SPARK-10484:
-----------------------------------

In cartesian produce implementation, there is 2 level nested loops, and 
exchanging the order of the join tables, will reduce the outer loop times(with 
smaller table), but increase the looping times of the inner loop(bigger table), 
this is actually a manually optimization for the sql query.

I created a PR for optimizing the cartesian join by involving the broadcast 
join.

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-10484
>                 URL: https://issues.apache.org/jira/browse/SPARK-10484
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yi Zhou
>            Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>    HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>    HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>    HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>    HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

Reply via email to