alamb commented on issue #10546: URL: https://github.com/apache/datafusion/issues/10546#issuecomment-2153424675
> Sorry for jumping in here, maybe this isn't the best issue but it's hard to keep up with all of the amazing work you're doing @alamb! Thanks @adriangb ❤️ > I wanted to pitch a use case I've been thinking about of storing a secondary index on a searchable async location. Think a relational database with ACID guarantees. In particular the key would be that hooks to do selections / pruning be async and that they pass in filters: I'd push down the filters into filters in the metadata store and run an actual query there that returns the files / row groups to scan. This is in contrast to #10549 for example where the index is in memory and fully materialized. Yes, I agree this is a very common usecase in modern database / data systems and one I hope will be easier to implement with some of these APIs (btw see https://github.com/apache/datafusion/pull/10813 for an even lower level API which I think brings this idea to its lowest leve.) > I realize that `TableProvider.scan` already serves this purpose, but it'd be nice to integrate into these new APIs instead of having to implement more things oneself because you're hooking in at a higher (lower?) level. I agree that you could do an `async` call as part of `TableProvider::scan` to fetch the relevant information from the remote store. Specifically, here https://github.com/apache/datafusion/blob/586241f06c3890dbad9a98abf6daee8e6ba43403/datafusion-examples/examples/parquet_index.rs#L223-L263 One thing that is still unclear in my mind is what other APIs we could offer to make it easier to implement an external index. Most of the the code in parquet_index.rs is to create the in memory index. Maybe we could create an example that shows how to implement a remote index 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
