timsaucer opened a new issue, #16821:
URL: https://github.com/apache/datafusion/issues/16821

   ### Is your feature request related to a problem or challenge?
   
   During my own creation of table providers for a custom data source, I ran 
into cases where I could gain significant performance boosts by taking 
advantage of known data partitioning and ordering. I wanted to put together 
some useful examples with a well known source that users can follow along.
   
   Specifically, I found it helpful to explain the different layers that a user 
might implement or reach for one of the generic implementations
   
   - Table Provider
   - Execution Plan
   - Sendable Record Batch Stream
   
   Also helpful would be an explanation of *where* the work should occur. I've 
seen multiple initial approaches where users are doing a significant amount of 
async work in the table provider or adding some blocking calls to async work in 
the execution plan. There is some documentation already as to where this should 
be done, but if you are not familiar with all of the internals it may not be 
obvious where to add your implementation.
   
   ### Describe the solution you'd like
   
   The thing I would most like on this issue is ideas about what would be good 
to include. Some of the things I've thought about are:
   
   - Demonstrating doing some CPU intensive work on a separate threadpool and 
how to manage the shutdown and passing of data around
   - Seeing if there are any specific queries that can be more performant if we 
know a priori about the data organization coming out of the generator. 
Alternatively adding a different data source just to demonstrate how improved 
sorting / partitioning information can be used.
   - Adding an example showing how to compute the partition hashes for some 
number of `target_partitions`
   - Look for examples where improved lex ordering can be taken advantage of (I 
haven't thought deeply about this)
   
   I have a very basic repository started where I used the table provider in 
https://github.com/clflushopt/datafusion-tpch as an in-memory example and a 
streaming version I wrote. I've also started toying around with a version that 
uses knowledge of the sorting of the data coming out of the lineitem table.
   
   https://github.com/timsaucer/tpch_table_provider
   
   ### Describe alternatives you've considered
   
   We have some documentation already in the online pages, but I would like to 
improve these with some specific examples that go beyond the basics.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to