[
https://issues.apache.org/jira/browse/SPARK-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-4775:
------------------------------------
Description:
I am working on testing of HBase joins. As part of this work some simple
vanilla SparkSQL tests were created. Some of the results are surprising: here
are the details:
------------------------------------
Consider the following schema that includes two columns:
{code}
case class JoinTable2Cols(intcol: Int, strcol: String)
{code}
Let us register two temp tables using this schema and insert 2 rows and 4 rows
respectively:
{code}
val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix,
s"valA$ix")})
rdd1.registerTempTable("SparkJoinTable1")
val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix,
s"valB$is")})
val table2 = rdd2.registerTempTable("SparkJoinTable2")
{code}
Here is the data in both tables:
{code}
Table1 Contents:
[1,valA1]
[2,valA2]
Table2 Contents:
[1,valB1]
[1,valB2]
[2,valB3]
[2,valB4]
{code}
Now let us join the tables on the first column:
{code}
select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
SparkJoinTable2 t2 on t1.intcol = t2.intcol
{code}
What results do we get:
came back with 4 results
{code}
Results
[1,1,valA1,valB2]
[1,1,valA1,valB2]
[2,2,valA2,valB4]
[2,2,valA2,valB4]
Huh??
Where did valB1 and valB3 go? Why do we have duplicate rows?
Note: the expected results were:
{code}
Seq(1, 1, "valA1", "valB1"),
Seq(1, 1, "valA1", "valB2"),
Seq(2, 2, "valA2", "valB3"),
Seq(2, 2, "valA2", "valB4"))
{code}
A standalone testing program is attached SparkSQLJoinSuite. An abridged
version of the actual output is also attached.
was:
I am working on testing of HBase joins. As part of this work some simple
vanilla SparkSQL tests were created. Some of the results are surprising: here
are the details:
------------------------------------
Consider the following schema that includes two columns:
case class JoinTable2Cols(intcol: Int, strcol: String)
Let us register two temp tables using this schema and insert 2 rows and 4 rows
respectively:
val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix,
s"valA$ix")})
rdd1.registerTempTable("SparkJoinTable1")
val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix,
s"valB$is")})
val table2 = rdd2.registerTempTable("SparkJoinTable2")
Here is the data in both tables:
Table1 Contents:
[1,valA1]
[2,valA2]
Table2 Contents:
[1,valB1]
[1,valB2]
[2,valB3]
[2,valB4]
Now let us join the tables on the first column:
select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
SparkJoinTable2 t2 on t1.intcol = t2.intcol
What results do we get:
came back with 4 results
Results
[1,1,valA1,valB2]
[1,1,valA1,valB2]
[2,2,valA2,valB4]
[2,2,valA2,valB4]
Huh??
Where did valB1 and valB3 go? Why do we have duplicate rows?
Note: the expected results were:
Seq(1, 1, "valA1", "valB1"),
Seq(1, 1, "valA1", "valB2"),
Seq(2, 2, "valA2", "valB3"),
Seq(2, 2, "valA2", "valB4"))
A standalone testing program is attached SparkSQLJoinSuite. An abridged
version of the actual output is also attached.
> Possible problem in a simple join? Getting duplicate rows and missing rows
> ---------------------------------------------------------------------------
>
> Key: SPARK-4775
> URL: https://issues.apache.org/jira/browse/SPARK-4775
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Environment: Run on Mac but should be agnostic
> Reporter: Stephen Boesch
> Assignee: Cheng Lian
>
> I am working on testing of HBase joins. As part of this work some simple
> vanilla SparkSQL tests were created. Some of the results are surprising:
> here are the details:
> ------------------------------------
> Consider the following schema that includes two columns:
> {code}
> case class JoinTable2Cols(intcol: Int, strcol: String)
> {code}
> Let us register two temp tables using this schema and insert 2 rows and 4
> rows respectively:
> {code}
> val rdd1 = sc.parallelize((1 to 2).map { ix => JoinTable2Cols(ix,
> s"valA$ix")})
> rdd1.registerTempTable("SparkJoinTable1")
> val ids = Seq((1, 1), (1, 2), (2, 3), (2, 4))
> val rdd2 = sc.parallelize(ids.map { case (ix, is) => JoinTable2Cols(ix,
> s"valB$is")})
> val table2 = rdd2.registerTempTable("SparkJoinTable2")
> {code}
> Here is the data in both tables:
> {code}
> Table1 Contents:
> [1,valA1]
> [2,valA2]
> Table2 Contents:
> [1,valB1]
> [1,valB2]
> [2,valB3]
> [2,valB4]
> {code}
> Now let us join the tables on the first column:
> {code}
> select t1.intcol t1intcol, t2.intcol t2intcol, t1.strcol t1strcol,
> t2.strcol t2strcol from SparkJoinTable1 t1 JOIN
> SparkJoinTable2 t2 on t1.intcol = t2.intcol
> {code}
> What results do we get:
> came back with 4 results
> {code}
> Results
> [1,1,valA1,valB2]
> [1,1,valA1,valB2]
> [2,2,valA2,valB4]
> [2,2,valA2,valB4]
> Huh??
> Where did valB1 and valB3 go? Why do we have duplicate rows?
> Note: the expected results were:
> {code}
> Seq(1, 1, "valA1", "valB1"),
> Seq(1, 1, "valA1", "valB2"),
> Seq(2, 2, "valA2", "valB3"),
> Seq(2, 2, "valA2", "valB4"))
> {code}
> A standalone testing program is attached SparkSQLJoinSuite. An abridged
> version of the actual output is also attached.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]