[ https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236919#comment-14236919 ]
Stephen Boesch commented on SPARK-4775: --------------------------------------- Two small tweaks to the testing class have been made. I have now created a github branch for this on the Huawei repo: https://github.com/Huawei-Spark/spark/blob/SPARKSQL-4775/sql/core/src/test/scala/org/apache/spark/sql/SparkSQLJoinSuite.scala This test may be run as follows: (setup): mvn -Pyarn -Phadoop-2.3 install compile package -DskipTests (run test): mvn -pl sql/core -Pyarn -Phadoop-2.3 -DwildcardSuites=org.apache.spark.sql.SparkSQLJoinSuite test results/output: Run starting. Expected test count is: 1 SparkSQLJoinSuite: 2014-12-06 10:42:36.174 java[22327:958089] Unable to load realm info from SCDynamicStore 10:42:41.370 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Table1 Contents: [1,valA1] [2,valA2] Table2 Contents: [1,valB1] [1,valB2] [2,valB3] [2,valB4] select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, t2.strcol t2strcol from SparkJoinTable1 t1 JOIN SparkJoinTable2 t2 on t1.intcol = t2.intcol select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, t2.strcol t2strcol from SparkJoinTable1 t1 JOIN SparkJoinTable2 t2 on t1.intcol = t2.intcol came back with 4 results Results [1,1,valA1,valB2] [1,1,valA1,valB2] [2,2,valA2,valB4] [2,2,valA2,valB4] ERROR: Row0 failed: Mismatch- act=valB2 exp=valB1 ERROR: Row2 failed: Mismatch- act=valB4 exp=valB3 - Basic Join on vanilla SparkSql: Simple Two Way 2 cols *** FAILED *** One or more rows did not match expected (SparkSQLJoinSuite.scala:81) Run completed in 23 seconds, 258 milliseconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 *** 1 TEST FAILED *** > Possible problem in a simple join? Getting duplicate rows and missing rows > --------------------------------------------------------------------------- > > Key: SPARK-4775 > URL: https://issues.apache.org/jira/browse/SPARK-4775 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Environment: Run on Mac but should be agnostic > Reporter: Stephen Boesch > > I am working on testing of HBase joins. As part of this work some simple > vanilla SparkSQL tests were created. Some of the results are surprising: > here are the details: > ------------------------------------ > Consider the following schema that includes two columns: > case class JoinTable2Cols(intcol: Int, strcol: String) > Let us register two temp tables using this schema and insert 2 rows and 4 > rows respectively: > val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix, > s"valA$ix")}) > rdd1.registerTempTable("SparkJoinTable1") > val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4)) > val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix, > s"valB$is")}) > val table2 = rdd2.registerTempTable("SparkJoinTable2") > Here is the data in both tables: > Table1 Contents: > [1,valA1] > [2,valA2] > Table2 Contents: > [1,valB1] > [1,valB2] > [2,valB3] > [2,valB4] > Now let us join the tables on the first column: > select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol, > t2.strcol t2strcol from SparkJoinTable1 t1 JOIN > SparkJoinTable2 t2 on t1.intcol = t2.intcol > What results do we get: > came back with 4 results > Results > [1,1,valA1,valB2] > [1,1,valA1,valB2] > [2,2,valA2,valB4] > [2,2,valA2,valB4] > Huh?? > Where did valB1 and valB3 go? Why do we have duplicate rows? > Note: the expected results were: > Seq(1, 1, "valA1", "valB1"), > Seq(1, 1, "valA1", "valB2"), > Seq(2, 2, "valA2", "valB3"), > Seq(2, 2, "valA2", "valB4")) > A standalone testing program is attached SparkSQLJoinSuite. An abridged > version of the actual output is also attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org