[ https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568893#comment-16568893 ]
Matthew Normyle commented on SPARK-24928: ----------------------------------------- In CartesianRDD.compute, changing: {color:#cc7832}for {color}(x <- rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context){color:#cc7832}; {color} y <- rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context)) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) {color:#333333} to:{color} {color:#cc7832}val {color}it1 = rdd1.iterator(currSplit.{color:#9876aa}s1{color}{color:#cc7832}, {color}context) {color:#cc7832}val {color}it2 = rdd2.iterator(currSplit.{color:#9876aa}s2{color}{color:#cc7832}, {color}context) {color:#cc7832}for {color}(x <- it1{color:#cc7832}; {color}y <- it2) {color:#cc7832}yield {color}(x{color:#cc7832}, {color}y) Seems to resolve this issue. I am brand new to Scala and Spark. Does anyone have any insight as to why this seemingly superficial change could make such a large difference? > spark sql cross join running time too long > ------------------------------------------ > > Key: SPARK-24928 > URL: https://issues.apache.org/jira/browse/SPARK-24928 > Project: Spark > Issue Type: Bug > Components: Optimizer > Affects Versions: 1.6.2 > Reporter: LIFULONG > Priority: Minor > > spark sql running time is too long while input left table and right table is > small hdfs text format data, > the sql is: select * from t1 cross join t2 > the line of t1 is 499999, three column > the line of t2 is 1, one column only > running more than 30mins and then failed > > > spark CartesianRDD also has the same problem, example test code is: > val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b") //1 line > 1 column > val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b") //499999 > line 3 column > val cartesian = new CartesianRDD(sc, twos, ones) > cartesian.count() > running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use > less than 10 seconds -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org