Re: [I] Full join on dataframe with only index yields dropped rows [datafusion-python]

via GitHub Thu, 04 Dec 2025 00:57:08 -0800


IshaGudewar commented on issue #1305:
URL: 
https://github.com/apache/datafusion-python/issues/1305#issuecomment-3610956298


   Thanks for the detailed report! I investigated the behavior and confirmed 
that the root cause is the schema merging during full outer joins.
   
   When one of the input DataFrames contains only the join key (no extra 
columns), the join schema builder drops the duplicated join key column from 
that side. Since that side contributes no additional fields, the corresponding 
rows end up empty and are removed from the final output. This leads to 
incorrect results for full joins.
   
   I can contribute by adding:
   
   1. A regression test showing that a full join between
           df(log_time)
           df(log_time, key_frame)
           must return all rows from both sides.
   
   2. A schema validation test verifying that join key rows from the 
“index-only” DataFrame are preserved.
   
   If helpful, I can also explore patching the join schema builder to ensure a 
full join retains rows from both sides even when one side has only the join key.
   
   I'll prepare a PR with the regression test shortly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Full join on dataframe with only index yields dropped rows [datafusion-python]

Reply via email to