LiaCastaneda opened a new issue, #20492: URL: https://github.com/apache/datafusion/issues/20492
### Is your feature request related to a problem or challenge? When the `HashJoinExec` build side returns 0 rows, the probe side stream is still fully consumed even though no output will be produced (for join types like INNER, LEFT, LEFT SEMI, etc.). We've seen queries where the probe scans +10 GB of data even when the build side returns no rows, and HashJoinExec outputs 0 rows. The short-circuit at [stream.rs:647](https://github.com/apache/datafusion/blob/ace9cd44b7356d60e6d69d0b98ac3f5606d55507/datafusion/physical-plan/src/joins/hash_join/stream.rs#L647) skips hash lookup work, but `fetch_probe_batch` is still called for every batch until the stream is exhausted. The transition from `WaitBuildSide` --> `FetchProbeBatch` is unconditional, there is no check after the build phase completes to decide whether the probe side needs to be polled at all. I'm not sure if this is intentional or if anything relies on the probe side being fully consumed in this scenario. If not, it seems like after `collect_build_side` completes, we could drop the probe stream immediately for the join types where empty build guarantees an empty output. ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
