adriangb commented on PR #21641: URL: https://github.com/apache/datafusion/pull/21641#issuecomment-4253861394
> I feel intead of making it non-blocking, there should be a timeout something like [this](https://trino.io/docs/current/admin/dynamic-filtering.html). I agree a timeout could make sense. I don't see that Trino has one (they have other thresholds, some of which we've already copied). IIRC we introduced the barrier for 2 reasons: 1. Correctness. Without synchronization (leader selection) the filters that get pushed down are incorrect. 2. Performance. The hope was that as you say we avoid a situation where we miss the ability to apply filters by a couple ms. This PR keeps (1) but drops (2). Two reasons I think this is a good tradeoff: 1. Performance was never absolute. If the filters end up being not effective, the query is large enough that it doesn't matter or the system is heavily IO constrainted it may make the query slower. See e.g. https://github.com/apache/datafusion/pull/19760, cc @gabotechs. 2. We are working on making scans more adaptive so that they can e.g. start applying a newly generated dynamic filter mid stream. In particular https://github.com/apache/datafusion/pull/21351. I think this mitigates the issue because if a dynamic filter is updated while we are reading the first file on the probe side we don't have to wait a full file to apply it. But yes this is ultimately a tradeoff. Especially because of the deadlocks we are currently experiencing I think trading some performance for correctness is justified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
