alamb commented on issue #10546:
URL: https://github.com/apache/datafusion/issues/10546#issuecomment-2153424675

   > Sorry for jumping in here, maybe this isn't the best issue but it's hard 
to keep up with all of the amazing work you're doing @alamb!
   
   Thanks @adriangb ❤️   
   
   
   > I wanted to pitch a use case I've been thinking about of storing a 
secondary index on a searchable async location. Think a relational database 
with ACID guarantees. In particular the key would be that hooks to do 
selections / pruning be async and that they pass in filters: I'd push down the 
filters into filters in the metadata store and run an actual query there that 
returns the files / row groups to scan. This is in contrast to #10549 for 
example where the index is in memory and fully materialized. 
   
   Yes, I agree this is a very common usecase in modern database / data systems 
and one I hope will be easier to implement with some of these APIs (btw see 
https://github.com/apache/datafusion/pull/10813 for an even lower level API 
which I think brings this idea to its lowest leve.)
   
   > I realize that `TableProvider.scan` already serves this purpose, but it'd 
be nice to integrate into these new APIs instead of having to implement more 
things oneself because you're hooking in at a higher (lower?) level.
   
   I agree that you could do an `async` call as part of `TableProvider::scan` 
to fetch the relevant information from the remote store. Specifically, here 
https://github.com/apache/datafusion/blob/586241f06c3890dbad9a98abf6daee8e6ba43403/datafusion-examples/examples/parquet_index.rs#L223-L263
   
   One thing that is still unclear in my mind is what other APIs we could offer 
to make it easier to implement an external index. Most of the the code in 
parquet_index.rs is to create the in memory index. Maybe we could create an 
example that shows how to implement a remote index 🤔 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to