alamb commented on code in PR #5732: URL: https://github.com/apache/arrow-datafusion/pull/5732#discussion_r1156366951
########## docs/source/user-guide/cli.md: ########## @@ -198,19 +232,56 @@ Details of the environment variables that can be used are - AWS_CONTAINER_CREDENTIALS_RELATIVE_URI -> <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html> - AWS_ALLOW_HTTP -> set to "true" to permit HTTP connections without TLS -Example: +## Registering OSS Data Sources -```bash -$ aws s3 cp test.csv s3://my-bucket/ -upload: ./test.csv to s3://my-bucket/test.csv +OSS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. -$ export AWS_DEFAULT_REGION=us-east-2 -$ export AWS_SECRET_ACCESS_KEY=*************************** -$ export AWS_ACCESS_KEY_ID=************** +```sql +CREATE EXTERNAL TABLE test +STORED AS PARQUET +OPTIONS( + 'access_key_id' '******', + 'secret_access_key' '******', + 'endpoint' 'https://bucket.oss-cn-hangzhou.aliyuncs.com' +) +LOCATION 'oss://bucket/path/file.parquet'; +``` + +The supported OPTIONS are: + +- access_key_id +- secret_access_key +- endpoint + +Note that the `endpoint` format of oss needs to be: `https://{bucket}.{oss-region-endpoint}` + +## Registering GCS Data Sources + +GCS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. Review Comment: ```suggestion [Google Cloud Storage](https://cloud.google.com/storage) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. ``` ########## datafusion-cli/src/object_storage.rs: ########## @@ -1,193 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -use datafusion::error::Result; -use std::{env, str::FromStr, sync::Arc}; - -use datafusion::datasource::object_store::{ - DefaultObjectStoreRegistry, ObjectStoreRegistry, -}; -use datafusion::error::DataFusionError; -use object_store::{aws::AmazonS3Builder, gcp::GoogleCloudStorageBuilder, ObjectStore}; -use url::Url; - -#[derive(Debug, PartialEq, Eq, clap::ArgEnum, Clone)] -pub enum ObjectStoreScheme { - S3, - GCS, -} - -impl FromStr for ObjectStoreScheme { - type Err = DataFusionError; - - fn from_str(input: &str) -> Result<Self> { - match input { - "s3" => Ok(ObjectStoreScheme::S3), - "gs" | "gcs" => Ok(ObjectStoreScheme::GCS), - _ => Err(DataFusionError::Execution(format!( - "Unsupported object store scheme {}", - input - ))), - } - } -} - -/// An [`ObjectStoreRegistry`] that can automatically create S3 and GCS stores for a given URL -#[derive(Debug, Default)] -pub struct DatafusionCliObjectStoreRegistry { - inner: DefaultObjectStoreRegistry, -} - -impl DatafusionCliObjectStoreRegistry { - pub fn new() -> Self { - Default::default() - } -} - -impl ObjectStoreRegistry for DatafusionCliObjectStoreRegistry { - fn register_store( - &self, - url: &Url, - store: Arc<dyn ObjectStore>, - ) -> Option<Arc<dyn ObjectStore>> { - self.inner.register_store(url, store) - } - - fn get_store(&self, url: &Url) -> Result<Arc<dyn ObjectStore>> { - self.inner.get_store(url).or_else(|_| { - let store = - ObjectStoreScheme::from_str(url.scheme()).map( - |scheme| match scheme { - ObjectStoreScheme::S3 => build_s3_object_store(url), - ObjectStoreScheme::GCS => build_gcs_object_store(url), - }, - )??; - - self.inner.register_store(url, store.clone()); - - Ok(store) - }) - } -} - -fn build_s3_object_store(url: &Url) -> Result<Arc<dyn ObjectStore>> { - let host = get_host_name(url)?; - match AmazonS3Builder::from_env().with_bucket_name(host).build() { - Ok(s3) => Ok(Arc::new(s3)), - Err(err) => Err(DataFusionError::External(Box::new(err))), - } -} - -fn build_gcs_object_store(url: &Url) -> Result<Arc<dyn ObjectStore>> { - let host = get_host_name(url)?; - let mut builder = GoogleCloudStorageBuilder::new().with_bucket_name(host); - - if let Ok(path) = env::var("GCP_SERVICE_ACCOUNT_PATH") { - builder = builder.with_service_account_path(path); - } - match builder.build() { - Ok(gcs) => Ok(Arc::new(gcs)), - Err(err) => Err(DataFusionError::External(Box::new(err))), - } -} - -fn get_host_name(url: &Url) -> Result<&str> { - url.host_str().ok_or_else(|| { - DataFusionError::Execution(format!( - "Not able to parse hostname from url, {}", - url.as_str() - )) - }) -} - -#[cfg(test)] -mod tests { Review Comment: do you think there is any value to porting these tests? ########## docs/source/user-guide/cli.md: ########## @@ -180,15 +180,49 @@ STORED AS CSV LOCATION '/path/to/aggregate_test_100.csv'; ``` -## Querying S3 Data Sources +## Registering S3 Data Sources -The CLI can query data in S3 if the following environment variables are defined: +S3 data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. -- `AWS_DEFAULT_REGION` -- `AWS_ACCESS_KEY_ID` -- `AWS_SECRET_ACCESS_KEY` +```sql +CREATE EXTERNAL TABLE test +STORED AS PARQUET +OPTIONS( + 'access_key_id' '******', + 'secret_access_key' '******', + 'region' 'us-east-2' +) +LOCATION 's3://bucket/path/file.parquet'; +``` + +The supported OPTIONS are: + +- access_key_id +- secret_access_key +- session_token +- region + +It is also simplify sql statements by environment variables. Review Comment: ```suggestion It is also possible to simplify sql statements by environment variables. ``` ########## datafusion-cli/src/object_storage.rs: ########## @@ -1,193 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one Review Comment: Very nice that we can consolidate the usage ########## docs/source/user-guide/cli.md: ########## @@ -180,15 +180,49 @@ STORED AS CSV LOCATION '/path/to/aggregate_test_100.csv'; ``` -## Querying S3 Data Sources +## Registering S3 Data Sources -The CLI can query data in S3 if the following environment variables are defined: +S3 data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. Review Comment: Perhaps we should define S3 by adding a link: ```suggestion [AWS S3](https://aws.amazon.com/s3/) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. ``` ########## docs/source/user-guide/cli.md: ########## @@ -198,19 +232,56 @@ Details of the environment variables that can be used are - AWS_CONTAINER_CREDENTIALS_RELATIVE_URI -> <https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html> - AWS_ALLOW_HTTP -> set to "true" to permit HTTP connections without TLS -Example: +## Registering OSS Data Sources -```bash -$ aws s3 cp test.csv s3://my-bucket/ -upload: ./test.csv to s3://my-bucket/test.csv +OSS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. Review Comment: ```suggestion [Alibaba cloud OSS](https://www.alibabacloud.com/product/object-storage-service) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
