[ 
https://issues.apache.org/jira/browse/SPARK-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Doi updated SPARK-8152:
----------------------------
    Attachment: side-by-side.png

In this screenshot, the dataframes "purchasedRecItems" and "grouped_B" should 
be identical and this is supported by the output of show().  They are unique on 
itemRecordId and have 9 rows each.

We join each of these with "countUsersDF".  This dataframe is unique on 
itemRecordId, and has 227 rows.

The result of the join should be 9 rows, since "itemRecordId" is unique in 
each.  However, when using  "purchasedRecItems", the result has 2043 = 227 * 9 
rows.  Output from show() reveals that each row of "countUsersDF" has been 
matched, regardless of the itemRecordId join condition.

> Dataframe Join Ignores Condition
> --------------------------------
>
>                 Key: SPARK-8152
>                 URL: https://issues.apache.org/jira/browse/SPARK-8152
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Eric Doi
>         Attachments: side-by-side.png
>
>
> When joining two tables A and B, on condition that A.X = B.X, in some cases 
> that condition is not fulfilled in the result.
> Suspect it might be due to duplicate column names in the source tables 
> causing confusion.  Is it possible for there to exist hidden fields in a 
> dataframe?
> Will attach a screenshot for more details.  The bug is reproducible but hard 
> to pinpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to