timsaucer opened a new issue, #16821: URL: https://github.com/apache/datafusion/issues/16821
### Is your feature request related to a problem or challenge? During my own creation of table providers for a custom data source, I ran into cases where I could gain significant performance boosts by taking advantage of known data partitioning and ordering. I wanted to put together some useful examples with a well known source that users can follow along. Specifically, I found it helpful to explain the different layers that a user might implement or reach for one of the generic implementations - Table Provider - Execution Plan - Sendable Record Batch Stream Also helpful would be an explanation of *where* the work should occur. I've seen multiple initial approaches where users are doing a significant amount of async work in the table provider or adding some blocking calls to async work in the execution plan. There is some documentation already as to where this should be done, but if you are not familiar with all of the internals it may not be obvious where to add your implementation. ### Describe the solution you'd like The thing I would most like on this issue is ideas about what would be good to include. Some of the things I've thought about are: - Demonstrating doing some CPU intensive work on a separate threadpool and how to manage the shutdown and passing of data around - Seeing if there are any specific queries that can be more performant if we know a priori about the data organization coming out of the generator. Alternatively adding a different data source just to demonstrate how improved sorting / partitioning information can be used. - Adding an example showing how to compute the partition hashes for some number of `target_partitions` - Look for examples where improved lex ordering can be taken advantage of (I haven't thought deeply about this) I have a very basic repository started where I used the table provider in https://github.com/clflushopt/datafusion-tpch as an in-memory example and a streaming version I wrote. I've also started toying around with a version that uses knowledge of the sorting of the data coming out of the lineitem table. https://github.com/timsaucer/tpch_table_provider ### Describe alternatives you've considered We have some documentation already in the online pages, but I would like to improve these with some specific examples that go beyond the basics. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org