alamb commented on issue #13563: URL: https://github.com/apache/datafusion/issues/13563#issuecomment-3196874378
> There are numerous advanced use cases and possible data source-level optimisations for table sampling. > Existing query engines and databases already implement sampling, but it is not in ANSI standard. There are different flavours, but essentially, they allow for specific sampling methods and percentages (or sometimes a number of rows) TABLESAMPLE [SYSTEM | BERNOULLI] (PERCENTAGE | ROWS) This is my core concern with adding any sort of sampling directly to DataFusion -- I think the usecases will vary widely across systems, and thus I worry that anything we build into DataFusion will likely be fairly complicated as well as not what other systems may want I think it is actually possible to implement table sampling with the existing APIS through a combination of 1. sql planner extension https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs 2. User defined extension nodes (aka add extension logical planning nodes) I would be willing to help make an example for this usecase, to show it is possible. I think it would be a nice showcase for how to extend systems using DataFusion without having to change the ecode. If there is broad support for adding table sampling to DataFusion, I think we should try and make it conform to the [design goals of DataFusion](https://docs.rs/datafusion/latest/datafusion/index.html#design-goals): 1. Work “out of the box”: Provide a very fast, world class query engine with minimal setup or required configuration. 1. Customizable everything: All behavior should be customizable by implementing traits. 1. Architecturally boring 🥱: Follow industrial best practice rather than trying cutting edge, but unproven, techniques. So maybe in this case we can begin with figuring out "how would we add table sampling" https://datafusion.apache.org/contributor-guide/architecture.html#creating-new-extension-apis -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org