[
https://issues.apache.org/jira/browse/SPARK-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Doi updated SPARK-8152:
----------------------------
Attachment: side-by-side.png
In this screenshot, the dataframes "purchasedRecItems" and "grouped_B" should
be identical and this is supported by the output of show(). They are unique on
itemRecordId and have 9 rows each.
We join each of these with "countUsersDF". This dataframe is unique on
itemRecordId, and has 227 rows.
The result of the join should be 9 rows, since "itemRecordId" is unique in
each. However, when using "purchasedRecItems", the result has 2043 = 227 * 9
rows. Output from show() reveals that each row of "countUsersDF" has been
matched, regardless of the itemRecordId join condition.
> Dataframe Join Ignores Condition
> --------------------------------
>
> Key: SPARK-8152
> URL: https://issues.apache.org/jira/browse/SPARK-8152
> Project: Spark
> Issue Type: Bug
> Reporter: Eric Doi
> Attachments: side-by-side.png
>
>
> When joining two tables A and B, on condition that A.X = B.X, in some cases
> that condition is not fulfilled in the result.
> Suspect it might be due to duplicate column names in the source tables
> causing confusion. Is it possible for there to exist hidden fields in a
> dataframe?
> Will attach a screenshot for more details. The bug is reproducible but hard
> to pinpoint.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]