[ https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459494#comment-17459494 ]
Nicholas Chammas commented on SPARK-25150: ------------------------------------------ I re-ran my test (described in the issue description + summarized in my comment just above) on Spark 3.2.0, and this issue appears to be resolved! Whether with cross joins enabled or disabled, I now get the correct results. Obviously, I have no clue what change since Spark 2.4.3 (the last time I reran this test) was responsible for the fix. But to be clear, in case anyone wants to reproduce my test: # Download all 6 files attached to this issue into a folder. # Then, from within that folder, run {{spark-submit zombie-analysis.py}} and inspect the output. # Then, enable cross joins (commented out on line 9), rerun the script, and reinspect the output. # Compare the final bit of output from both runs against {{{}expected-output.txt{}}}. > Joining DataFrames derived from the same source yields confusing/incorrect > results > ---------------------------------------------------------------------------------- > > Key: SPARK-25150 > URL: https://issues.apache.org/jira/browse/SPARK-25150 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.1, 2.4.3 > Reporter: Nicholas Chammas > Priority: Major > Labels: correctness > Attachments: expected-output.txt, > output-with-implicit-cross-join.txt, output-without-implicit-cross-join.txt, > persons.csv, states.csv, zombie-analysis.py > > > I have two DataFrames, A and B. From B, I have derived two additional > DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very > confusing error: > {code:java} > Join condition is missing or trivial. > Either: use the CROSS JOIN syntax to allow cartesian products between these > relations, or: enable implicit cartesian products by setting the configuration > variable spark.sql.crossJoin.enabled=true; > {code} > Then, when IĀ configure "spark.sql.crossJoin.enabled=true" as instructed, > Spark appears to give me incorrect answers. > I am not sure if I am missing something obvious, or if there is some kind of > bug here. The "join condition is missing" error is confusing and doesn't make > sense to me, and the seemingly incorrect output is concerning. > I've attached a reproduction, along with the output I'm seeing with and > without the implicit cross join enabled. > I realize the join I've written is not "correct" in the sense that it should > be left outer join instead of an inner join (since some of the aggregates are > not available for all states), but that doesn't explain Spark's behavior. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org