[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Nicholas Chammas (JIRA) Fri, 28 Sep 2018 12:31:24 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632281#comment-16632281
 ]


Nicholas Chammas commented on SPARK-25150:
------------------------------------------

I've uploaded the expected output.

I realize that the reproduction I've attached to this ticket 
(zombie-analysis.py plus the related files), though complete and 
self-contained, is a bit verbose. If it's not helpful enough I will see if I 
can boil it down further.

[~petertoth] - I suggest you take another look at the output with cross joins 
enabled and compare it to what (I think) is the correct expected output. If I'm 
understanding things correctly, there are two issues: 1) the bad error when 
cross join is not enabled (there should be no error), and 2) the incorrect 
results when cross join _is_ enabled (the results I just uploaded).

Your PR doesn't appear to investigate or address the incorrect results issue, 
so I'm not sure if it would fix that too of if I am just mistaken about there 
being a second issue.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25150
>                 URL: https://issues.apache.org/jira/browse/SPARK-25150
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Nicholas Chammas
>            Priority: Major
>         Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Reply via email to