[
https://issues.apache.org/jira/browse/SPARK-17211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435490#comment-15435490
]
Himanish Kushary commented on SPARK-17211:
------------------------------------------
[~dongjoon] [~jseppanen] I am also seeing this issue on EMR using release
emr-5.0.0, Amazon Hadoop 2.7.2 and Spark 2.0.0
For a simple join {{df1.join(df2, "id")}} the broadcast causes the fields from
df2 to either return "null" or get assigned an incorrect value. Disabling the
broadcast works as expected.
Everything works fine locally through test cases. Could it be something to do
with the EMR environment ?
> Broadcast join produces incorrect results
> -----------------------------------------
>
> Key: SPARK-17211
> URL: https://issues.apache.org/jira/browse/SPARK-17211
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.0.0
> Reporter: Jarno Seppanen
>
> Broadcast join produces incorrect columns in join result, see below for an
> example. The same join but without using broadcast gives the correct columns.
> Running PySpark on YARN on Amazon EMR 5.0.0.
> {noformat}
> import pyspark.sql.functions as func
> keys = [
> (54000000, 0),
> (54000001, 1),
> (54000002, 2),
> ]
> keys_df = spark.createDataFrame(keys, ['key_id', 'value']).coalesce(1)
> keys_df.show()
> # +--------+-----+
> # | key_id|value|
> # +--------+-----+
> # |54000000| 0|
> # |54000001| 1|
> # |54000002| 2|
> # +--------+-----+
> data = [
> (54000002, 1),
> (54000000, 2),
> (54000001, 3),
> ]
> data_df = spark.createDataFrame(data, ['key_id', 'foo'])
> data_df.show()
> # +--------+---+
>
> # | key_id|foo|
> # +--------+---+
> # |54000002| 1|
> # |54000000| 2|
> # |54000001| 3|
> # +--------+---+
> ### INCORRECT ###
> data_df.join(func.broadcast(keys_df), 'key_id').show()
> # +--------+---+--------+
>
> # | key_id|foo| value|
> # +--------+---+--------+
> # |54000002| 1|54000002|
> # |54000000| 2|54000000|
> # |54000001| 3|54000001|
> # +--------+---+--------+
> ### CORRECT ###
> data_df.join(keys_df, 'key_id').show()
> # +--------+---+-----+
> # | key_id|foo|value|
> # +--------+---+-----+
> # |54000000| 2| 0|
> # |54000001| 3| 1|
> # |54000002| 1| 2|
> # +--------+---+-----+
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]