nuno-faria commented on PR #17197:
URL: https://github.com/apache/datafusion/pull/17197#issuecomment-3199697422

   > @nuno-faria I was able to confirm this is just a race between updating the 
filter and starting work on the probe side: 
https://github.com/pydantic/datafusion/compare/fix-hash-join-partitioned...pydantic:datafusion:demo-race?expand=1
   
   I see, that makes sense. I think in theory we would need to have something 
like what Postgres does and determine at plan time that following the 
parameterized path would be the best approach, which would be quite complex.
   
   >But I don't think that is a good approach long term. The code is more 
complex and there are potentials for deadlocks. The way the current code is 
structured even if there are bugs it should never be slower than not having 
dynamic filters. They just may take a couple batches to kick in, they won't 
help small queries on a local SSD (like we've been testing here) much but will 
help massively for large queries on slower storage, etc.
   
   Agreed.
   
   I did some tests and found that the number of rows is kept to the minimuim 
when the number of partitions is set to 1. On a simple join query this makes it 
more than 20x faster than DuckDB.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to