adriangb commented on PR #22024:
URL: https://github.com/apache/datafusion/pull/22024#issuecomment-4443340295

   @alamb I think the biggest thing is that it is *not* possible to implement 
this sort of sampling externally by passing in a `ParquetAccessPlan`: you don't 
know what the row groups and pages look like until you open the file.
   
   I'm not sure what other sampling strategies might look like. To me it only 
really makes sense to sample at the row group / page level. Do you have 
thoughts on what other sampling strategies for Parquet would look like? I 
linked to multiple systems which sample at the "block" level. For parquet that 
is row groups / pages. The pages (row fraction) part is perhaps a bit more 
questionable, I'm happy to remove that and add that as a followup if you'd like.
   
   I'm open to prototyping on some sort of `ParquetAccessPlanOptimizer` but I'm 
not sure it will end up being a simple abstraction, I suspect it will be quite 
leaky. That is: every time you want to add a new optimizer you have to change 
the API to add more inputs / more context or more outputs / things it can 
change. The adaptive dynamic filter work for example has to touch _a lot more_ 
than just the `ParquetAccessPlan`. I'd guess we'd end up with a very leaky 
abstraction. IMO doing this as structured in this PR and factoring out as much 
code into it's own modules and such probably gets us 90% of the wins without 
forcing us into APIs we then have to constantly churn.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to