[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Brandon Perry (JIRA) Tue, 09 Apr 2019 07:59:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813511#comment-16813511
 ]


Brandon Perry commented on SPARK-25150:
---------------------------------------

[~srowen], I ran into this situation yesterday as well, and I think there may 
be some miscommunication about expected behavior vs actual here.  Many people 
are accustomed to writing joins in a sequential manner in SQL; using the sample 
scenario here:

{code:SQL|borderstyle=solid}
SELECT 
a.State, 
a.`Total Population`,
b.count AS `Total Humans`,
c.count AS `Total Zombies`
FROM states AS a
JOIN total_humans AS b
ON a.state = b.state
JOIN total_zombies AS c
ON a.state = c.state
ORDER BY a.state ASC;
{code}

On virtually all ANSI SQL systems, this will result in the output which 
[~nchammas] mentions is expected.  However, it looks like Spark actually 
evaluates the chained joins by doing something like (states JOIN humans ON 
state) JOIN (states JOIN zombies ON state) ON (_no condition specified_).

Part of the problem is that even when you attempt to fix the states['State'] 
join, you get the "trivially inferred" warning with inappropriate output, as 
they share the same lineage and Spark optimizes past the intended logic:

{code:Python|borderstyle=solid}
states_with_humans = states \
        .join(
            total_humans,
            on=(states['State'] == total_humans['State'])
        )
analysis = states_with_humans \
        .join(
            total_zombies,
            on=(states_with_humans['State'] == total_zombies['State'])
        ) \
        .orderBy(states['State'], ascending=True) \
        .select(
            states_with_humans['State'],
            states_with_humans['Total Population'],
            states_with_humans['count'].alias('Total Humans'),
            total_zombies['count'].alias('Total Zombies'),
        )
    )
{code}

Is there something we're all missing here?  This seems to be a cookie-cutter 
example of a three-way join not functioning as expected without explicit 
aliasing.  Is there a reason this behavior is desirable?

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25150
>                 URL: https://issues.apache.org/jira/browse/SPARK-25150
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Nicholas Chammas
>            Priority: Major
>         Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Reply via email to