alamb opened a new pull request, #14286:
URL: https://github.com/apache/datafusion/pull/14286

   Note: This PR contains a (substantial) example and supporting code. It has 
no changes to the core.
   
   ## Which issue does this PR close?
   
   - Closes https://github.com/apache/datafusion/issues/12393
   
   - Note this is new version version of 
https://github.com/apache/datafusion/pull/13424
   
   ## Rationale for this change
   
   I have heard from multiple people multiple times that the specifics of using 
multiple threadpools for separate CPU and IO work in DataFusion is confusing.
   
   However, this is a key detail for building high performance engines which 
process data directly from remote storage, which I think is a very important 
capability for DataFusion
   
   My [past attempts](https://github.com/apache/datafusion/pull/13424) to make 
this example have been bogged down trying to get consensus on details of how to 
transfer results across streams, the wisdom of wrapping streams, and other 
details. 
   
   However, there have been no other actual alternatives proposed (aka no 
actual code that I could write an example with). So while the approach of 
wrapping streams may be a bit ugly, it is  working for us  in production at 
InfluxDB and I believe @adriangb  said they are using the same strategy at  
Pydantic. 
   
   In my opinion,  the community needs **something** that works, even if it is 
not optimial, which is this example.
   
   I would personally like to merge it in so it is easier to find (and not 
locked in a PR) and iterate afterwards
   
   ## What changes are included in this PR?
   
   1. `thread_pools.rs` example
   2. a `dedicated_executor.rs` module that 
   
   Note that the DedicatedExecutor code is orginally from 
   1. InfluxDB 3.0 [(source 
link)](https://github.com/influxdata/influxdb3_core/tree/6fcbb004232738d55655f32f4ad2385523d10696/executor),
 largely written by @tustvold and @crepererum 
   3. The `IoObjectStore` wrapper is  based on work from @matthewmturne in 
https://github.com/datafusion-contrib/datafusion-dft/pull/247 
   
   Note that I have  purposely avoided any changes to the DataFusion crates 
(such as adding `DedicatedExecutor` to `physical-plan`), but I have written the 
code with tests with an eye towards doing exactly such a thing one we have some 
experience / feedback on the approach. This is tracked by 
   - https://github.com/apache/datafusion/issues/14285
   
   Right now, I think the most important thing is to get this example in for 
people to try and confirm if it helps them
   
   
   
   ## Are these changes tested?
   
   Yes the example is run as part of CI and there are tests 
   
   TODO: Verify that the tests in the examples run in CI
   
   ## Are there any user-facing changes?
   
   Not really


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to