alamb commented on issue #1836:
URL: 
https://github.com/apache/arrow-datafusion/issues/1836#issuecomment-1042856074


   It seems to me that if your goal is to basically do some object store calls, 
get the list of files from S3 and then build a catalog from that snapshot the 
memory provider / builder is probably the simplest route to take
   
   However, if you want to do more sophisticated things (like, for example, not 
traverse the s3 directory / prefix structure up front and do it more on demand) 
a specialized implementation of `S3Catalog` might be more helpful
   
   For example of "on demand" if you had object store like this:
   ```
   s3://active/schema1/tableA
   s3://active/schema1/tableB
   s3://active/schema1/tableC
   s3://active/schema2/tableD
   s3://hist/...
   ...
   ```
   
   You could make an `S3Catalog` for each of the first prefixes (`active` and 
`hist`).
   
    If you wrote a query that did like `SELECT * from active.schema1.tableA` 
then
   1.  the `S3Catalog` for `active` would be asked for what schemas do you have 
(it could ask object store and return `S3Schemas` for `schema1` and `schema2`
   2. The `S3Schema` for `schema2` could then be asked for what tables do you 
know, and it would ask object store and return table providers for `tableA`, 
`tableB` and `tableC`
   
   As you can imagine there are tradeoffs in the two approaches: the first will 
take longer to setup but be much faster to query each time (but also won't see 
any new files that appear in S3). The second will be very fast to setup and 
will see new files that appear, but will make object store requests during 
planning so will be slower.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to