alamb opened a new pull request, #14286: URL: https://github.com/apache/datafusion/pull/14286
Note: This PR contains a (substantial) example and supporting code. It has no changes to the core. ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/12393 - Note this is new version version of https://github.com/apache/datafusion/pull/13424 ## Rationale for this change I have heard from multiple people multiple times that the specifics of using multiple threadpools for separate CPU and IO work in DataFusion is confusing. However, this is a key detail for building high performance engines which process data directly from remote storage, which I think is a very important capability for DataFusion My [past attempts](https://github.com/apache/datafusion/pull/13424) to make this example have been bogged down trying to get consensus on details of how to transfer results across streams, the wisdom of wrapping streams, and other details. However, there have been no other actual alternatives proposed (aka no actual code that I could write an example with). So while the approach of wrapping streams may be a bit ugly, it is working for us in production at InfluxDB and I believe @adriangb said they are using the same strategy at Pydantic. In my opinion, the community needs **something** that works, even if it is not optimial, which is this example. I would personally like to merge it in so it is easier to find (and not locked in a PR) and iterate afterwards ## What changes are included in this PR? 1. `thread_pools.rs` example 2. a `dedicated_executor.rs` module that Note that the DedicatedExecutor code is orginally from 1. InfluxDB 3.0 [(source link)](https://github.com/influxdata/influxdb3_core/tree/6fcbb004232738d55655f32f4ad2385523d10696/executor), largely written by @tustvold and @crepererum 3. The `IoObjectStore` wrapper is based on work from @matthewmturne in https://github.com/datafusion-contrib/datafusion-dft/pull/247 Note that I have purposely avoided any changes to the DataFusion crates (such as adding `DedicatedExecutor` to `physical-plan`), but I have written the code with tests with an eye towards doing exactly such a thing one we have some experience / feedback on the approach. This is tracked by - https://github.com/apache/datafusion/issues/14285 Right now, I think the most important thing is to get this example in for people to try and confirm if it helps them ## Are these changes tested? Yes the example is run as part of CI and there are tests TODO: Verify that the tests in the examples run in CI ## Are there any user-facing changes? Not really -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org