kosiew commented on issue #1305: URL: https://github.com/apache/datafusion-python/issues/1305#issuecomment-3702139803
The current Python join wrapper drops mutually named keys by default for all join types, including full, by projecting away one side’s qualified key column when drop_duplicate_keys is true. https://github.com/apache/datafusion-python/blob/fcd70567dedc580416c2931cc7f25e3960704ace/src/dataframe.rs#L650-L715 The user guide also documents this default key-dropping behavior without noting any join-type exceptions. https://github.com/apache/datafusion-python/blob/fcd70567dedc580416c2931cc7f25e3960704ace/docs/source/user-guide/common-operations/joins.rst#L109-L136 Given that, I agree with @renato’s concern: for full outer joins the two key columns are not equivalent, so dropping one of them can remove the only way to represent unmatched rows. Disallowing drop_duplicate_keys=True (or forcing it to False) for how="full" would avoid silent data loss and keep both key columns available for user-controlled coalescing or renaming afterward, aligning with SQL-style expectations. @mesejo’s coalesce-based approach also addresses the correctness gap and matches behaviors in other libraries. If we keep drop_duplicate_keys=True as the default, applying coalesce for outer joins instead of dropping one side would preserve rows while still returning a single key column; the trade-off is slightly higher compute and less user control over per-column treatment. Either way, documenting the chosen semantics explicitly for full joins will help set user expectations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
