adriangb commented on PR #22024: URL: https://github.com/apache/datafusion/pull/22024#issuecomment-4443340295
@alamb I think the biggest thing is that it is *not* possible to implement this sort of sampling externally by passing in a `ParquetAccessPlan`: you don't know what the row groups and pages look like until you open the file. I'm not sure what other sampling strategies might look like. To me it only really makes sense to sample at the row group / page level. Do you have thoughts on what other sampling strategies for Parquet would look like? I linked to multiple systems which sample at the "block" level. For parquet that is row groups / pages. The pages (row fraction) part is perhaps a bit more questionable, I'm happy to remove that and add that as a followup if you'd like. I'm open to prototyping on some sort of `ParquetAccessPlanOptimizer` but I'm not sure it will end up being a simple abstraction, I suspect it will be quite leaky. That is: every time you want to add a new optimizer you have to change the API to add more inputs / more context or more outputs / things it can change. The adaptive dynamic filter work for example has to touch _a lot more_ than just the `ParquetAccessPlan`. I'd guess we'd end up with a very leaky abstraction. IMO doing this as structured in this PR and factoring out as much code into it's own modules and such probably gets us 90% of the wins without forcing us into APIs we then have to constantly churn. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
