Re: [PR] Fix HashJoinExec sideways information passing for partitioned queries [datafusion]

via GitHub Tue, 19 Aug 2025 01:07:45 -0700


nuno-faria commented on PR #17197:
URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3199697422

> @nuno-faria I was able to confirm this is just a race between updating the
filter and starting work on the probe side:
https://github.com/pydantic/datafusion/compare/fix-hash-join-partitioned...pydantic:datafusion:demo-race?expand=1

I see, that makes sense. I think in theory we would need to have something
like what Postgres does and determine at plan time that following the
parameterized path would be the best approach, which would be quite complex.

>But I don't think that is a good approach long term. The code is more
complex and there are potentials for deadlocks. The way the current code is
structured even if there are bugs it should never be slower than not having
dynamic filters. They just may take a couple batches to kick in, they won't
help small queries on a local SSD (like we've been testing here) much but will
help massively for large queries on slower storage, etc.

Agreed.

I did some tests and found that the number of rows is kept to the minimuim
when the number of partitions is set to 1. On a simple join query this makes it
more than 20x faster than DuckDB.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix HashJoinExec sideways information passing for partitioned queries [datafusion]

Reply via email to