alamb commented on issue #16353: URL: https://github.com/apache/datafusion/issues/16353#issuecomment-2969824040
> As we’ve discussed above the channel receiver is already doing that for us. For some reason file IO was not. I’m not sure I understand why that’s the case and will try to figure out why tomorrow. This is consistent with our observations at InfluxData: we saw uncancellable queries when feeding our plan from an in memory cache (not a file / memory) > https://github.com/pepijnve/datafusion/blob/cancel_spec/dev/design/cancellation.md This is a really nice writeup: it matches my understanding / mental model. It would also make the start of a great blog post for the DataFusion blog FWIW and I filed a ticket to track that idea 🎣 : - https://github.com/apache/datafusion/issues/16396 > The more I read the more I can see DataFusion is basically a modern day Volcano. I think this is an accurate assessment, though I would probably phrase it as "DataFusion uses Volcano-style parallelism where operators are single threaded and Exchange (`RepartitionExec`) operators handle parallelism". The other prevalent style is called "Morsel Driven Parallelism" popularized by DuckDB and TUM/Umbra [in this paper](https://db.in.tum.de/~leis/papers/morsels.pdf) which uses operators that are explicitly multi-threaded. > Each of the colored blocks is an independently executing sub portion of the query. Translated to Tokio each of these colored blocks is a separate concurrent task. Each of those tasks needs to be cooperatively scheduled to guarantee all of them get a fair share of time to run. This is true in theory -- but I think we also take pains to try and avoid "over scheduling" tasks in tokio -- for example, we purposely only have `N` input partitions (and hence N streams) per scan, even if there are 100+ files -- the goal is to keep all the cores busy, but not oversubscribed. > So what does all this mean in terms of implementation: This also sounds fine to me, and would be happy to review PRs, etc. However it is not 100% clear if your proposed design 1. fixes any bugs / adds features over the current one, 2. Is "just" cleaner way to implement the same thing (this is also a fine thing to contribute as well). For example, I wonder if there are additional tests / cases that would be improved with the proposed implementation 🤔 > The one thing that we still cannot solve automatically then is dynamic query planning. Operators that create streams dynamically still have to make sure they set things up correctly themselves. In my opinion this is fine -- if operators are making dynamic streams, that is an advanced usecase that today must still handle canceling / yielding. I think it is ok if we can't find a way to automatically provide yielding behavior to them (they are no worse off then today) > One possible downside to this approach is that the cooperative scheduling budget is implementation specific to the Tokio runtime. DataFusion becomes more tied to Tokio rather than less. Not sure if that's an issue or not. I personally don't think this is an issue as I don't see any movement and have not heard any desire to move away from tokio. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org