[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

Matthew Normyle (JIRA) Fri, 03 Aug 2018 16:04:54 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568893#comment-16568893
 ]


Matthew Normyle commented on SPARK-24928:
-----------------------------------------

In CartesianRDD.compute, changing:

{color:#cc7832}for {color}(x <- 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, 
{color}context){color:#cc7832};
{color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, 
{color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

{color:#333333} to:{color}

{color:#cc7832}val {color}it1 = 
rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context)
{color:#cc7832}val {color}it2 = 
rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)

{color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) 
{color:#cc7832}yield {color}(x{color:#cc7832}, {color}y)

Seems to resolve this issue.

I am brand new to Scala and Spark. Does anyone have any insight as to why this 
seemingly superficial change could make such a large difference?

> spark sql cross join running time too long
> ------------------------------------------
>
>                 Key: SPARK-24928
>                 URL: https://issues.apache.org/jira/browse/SPARK-24928
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer
>    Affects Versions: 1.6.2
>            Reporter: LIFULONG
>            Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 499999, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //499999 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

Reply via email to