alamb commented on issue #13563:
URL: https://github.com/apache/datafusion/issues/13563#issuecomment-3196874378

   > There are numerous advanced use cases and possible data source-level 
optimisations for table sampling. 
   
   > Existing query engines and databases already implement sampling, but it is 
not in ANSI standard. There are different flavours, but essentially, they allow 
for specific sampling methods and percentages (or sometimes a number of rows) 
TABLESAMPLE [SYSTEM | BERNOULLI] (PERCENTAGE | ROWS)
   
   This is my core concern with adding any sort of sampling directly to 
DataFusion -- I think the usecases will vary widely across systems, and thus I 
worry that anything we build into DataFusion will likely be fairly complicated 
as well as not what other systems may want
   
   I think it is actually possible to implement table sampling with the 
existing APIS through a combination of 
   1. sql planner extension 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs
   2. User defined extension nodes (aka add extension logical planning nodes)
   
   I would be willing to help make an example for this usecase, to show it is 
possible. I think it would be a nice showcase for how to extend systems using 
DataFusion without having to change the ecode. 
   
   If there is broad support for adding table sampling to DataFusion, I think 
we should try and make it conform to the [design goals of 
DataFusion](https://docs.rs/datafusion/latest/datafusion/index.html#design-goals):
   
   1. Work “out of the box”: Provide a very fast, world class query engine with 
minimal setup or required configuration.
   1. Customizable everything: All behavior should be customizable by 
implementing traits.
   1. Architecturally boring 🥱: Follow industrial best practice rather than 
trying cutting edge, but unproven, techniques.
   
   So maybe in this case we can begin with figuring out "how would we add table 
sampling" 
   
https://datafusion.apache.org/contributor-guide/architecture.html#creating-new-extension-apis
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to