Ben Moran created SPARK-10914:
---------------------------------
Summary: Incorrect empty join sets
Key: SPARK-10914
URL: https://issues.apache.org/jira/browse/SPARK-10914
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.5.1, 1.5.0
Environment: Ubuntu 14.04 (spark-slave), 12.04 (master)
Reporter: Ben Moran
Using an inner join, to match together two integer columns, I generally get no
results when there should be matches. But the results vary and depend on
whether the dataframes are coming from SQL, JSON, or cached, as well as the
order in which I cache things and query them.
This minimal example reproduces it consistently for me in the spark-shell, on
new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from
http://spark.apache.org/downloads.html.)
/* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */
val x = sql("select 1 xx union all select 2")
val y = sql("select 1 yy union all select 2")
x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */
/* If I cache both tables it works: */
x.cache()
y.cache()
x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */
/* but this still doesn't work: */
x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]