tustvold opened a new issue, #2206:
URL: https://github.com/apache/arrow-datafusion/issues/2206

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   This has been discussed in various places, 
https://github.com/apache/arrow-datafusion/issues/907 and 
https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 to name 
a few, so creating an issue for visibility.
   
   **Describe the solution you'd like**
   
   I would propose creating a new datafusion-contrib crate, perhaps 
`datafusion-catalog-glue`, which communicates with an [AWS Glue Data 
Catalog](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html).
   
   I'll leave the exact design for whoever picks this up, but I might expect 
something along the following lines.
   
   * Create a `GlueCatalog` with an optional catalog ID
   * Provide a `async fn GlueCatalog::list_databases(&self) -> Vec<String>` to 
list the databases
   * Provide a `async fn GlueCatalog::get_database(&self, name: &str) -> 
Result<GlueDatabase>` to get a database
   * Implement `SchemaProvider` for `GlueDatabase`
   
   I think it should be possible to reuse the `FileScanConfig` structure used 
by `ListingTable` to simplify implementation of the `TableProvider`.
   
   **Describe alternatives you've considered**
   
   We could not support AWS Glue
   
   **Additional context**
   
   This will help with 
https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53 by 
alleviating the need to infer the schema from the files on every query, and 
only listing files in non-pruned partitions.
   
   This may need to depend on 
https://github.com/datafusion-contrib/datafusion-objectstore-s3 as I think it 
will still need to list S3 in order to get the files within a given partition.
   
   The Glue API is not the snappiest of things, so a future extension might be 
to cache the metadata returned, as is done by the [Java 
client](https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore#enabling-client-side-caching-for-catalog).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to