[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Nicholas Chammas (Jira) Tue, 14 Dec 2021 13:21:24 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459494#comment-17459494
 ]


Nicholas Chammas commented on SPARK-25150:
------------------------------------------

I re-ran my test (described in the issue description + summarized in my comment 
just above) on Spark 3.2.0, and this issue appears to be resolved! Whether with 
cross joins enabled or disabled, I now get the correct results.

Obviously, I have no clue what change since Spark 2.4.3 (the last time I reran 
this test) was responsible for the fix.

But to be clear, in case anyone wants to reproduce my test:
 # Download all 6 files attached to this issue into a folder.
 # Then, from within that folder, run {{spark-submit zombie-analysis.py}} and 
inspect the output.
 # Then, enable cross joins (commented out on line 9), rerun the script, and 
reinspect the output.
 # Compare the final bit of output from both runs against 
{{{}expected-output.txt{}}}.

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-25150
>                 URL: https://issues.apache.org/jira/browse/SPARK-25150
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1, 2.4.3
>            Reporter: Nicholas Chammas
>            Priority: Major
>              Labels: correctness
>         Attachments: expected-output.txt, 
> output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, 
> persons.csv, states.csv, zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not "correct" in the sense that it should 
> be left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

Reply via email to