adriangb opened a new issue, #17954: URL: https://github.com/apache/datafusion/issues/17954
Consider the scenario of: ```sql SELECT * FROM large_table JOIN small_table ON large_table.id = small_table.id WHERE small_table.name = 'Adrian'; ``` As per [our recent blog post](https://datafusion.apache.org/blog/2025/09/10/dynamic-filters/) we will first scan `small_table`, find the `id` for `'Adrian'` and then scan `large_table` with that information available. But what if we had an external table level point lookup index for `large_table.id`? We won't be able to use that during the scan. One option is to add hooks to the parquet readers that get called before each scan, something like: ```rust trait ScanPlanUpdater { async fn rescan(&self, file: PartitionedFile, plan: FileScanPlan) -> Result<FileScanPlan>; } ``` Then we call this before we do any more work on this file to allow checking the point lookup index. The main issue with this option is that it could result in *a lot more* of lookups into the point lookup index than if it was done once at the table level. Maybe implementations of `ScanPlanUpdater` can have some sort of cache? I don't see a way to do it at the table level, the concept of a table is long gone by this point and I can't think of a low friction way to apply a filter to an entire `DataSourceExec`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
