alamb commented on code in PR #11035:
URL: https://github.com/apache/datafusion/pull/11035#discussion_r1720948344


##########
datafusion/catalog/src/dynamic_file/catalog.rs:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! dynamic_file contains [`DynamicFileCatalog`] that creates tables from file 
paths
+//! if the wrapped [`CatalogProviderList`] doesn't have the table provider.
+
+use crate::{CatalogProvider, CatalogProviderList, SchemaProvider, 
TableProvider};
+use async_trait::async_trait;
+#[cfg(feature = "home_dir")]
+use dirs::home_dir;
+use std::any::Any;
+use std::sync::Arc;
+
+/// Wrap another catalog provider list
+pub struct DynamicFileCatalog {

Review Comment:
   I wonder if it would make more sense to call this DynamicFileCatalog with 
the idea that eventually it would support arbitrary urls, not just local files?



##########
datafusion/sqllogictest/src/test_context.rs:
##########
@@ -91,13 +91,18 @@ impl TestContext {
                 {
                     info!("Registering avro tables");
                     register_avro_tables(&mut test_ctx).await;
+                    test_ctx.ctx = test_ctx.ctx.enable_url_table();
                 }
                 #[cfg(not(feature = "avro"))]
                 {
                     info!("Skipping {file_name} because avro feature is not 
enabled");
                     return None;
                 }
             }
+            "describe.slt" | "arrow_files.slt" | "csv_files.slt" | "json.slt"

Review Comment:
   I think it would help to point out some rationale
   
   ```suggestion
              // For (only) these files, enable the dynamic file catalog
               "describe.slt" | "arrow_files.slt" | "csv_files.slt" | "json.slt"
   ```



##########
README.md:
##########
@@ -91,6 +91,7 @@ Optional features:
 - `backtrace`: include backtrace information in error messages
 - `pyarrow`: conversions between PyArrow and DataFusion types
 - `serde`: enable arrow-schema's `serde` feature
+- `home_dir` : enable support for substituting the tilde character in the file 
path with the user home directory for the URL table

Review Comment:
   I think an optional feature makes sense
   
   Another alternative would be to avoid the dependency entirely. by leaving 
tilde expansion in datafusion-cli and an API in the core to call it



##########
datafusion-cli/src/catalog.rs:
##########
@@ -115,13 +114,14 @@ impl CatalogProvider for DynamicFileCatalogProvider {
     }
 }
 
-/// Wraps another schema provider
-struct DynamicFileSchemaProvider {
+/// Wraps another schema provider. [DynamicObjectStoreSchemaProvider] is 
responsible for registering the required

Review Comment:
   I am trying to understand why datafusion-cli can't simply use 
`DynamicFileCatalog` -- why does it need another layer of wrapping?
   
   Is it the need to dynamically create `ObjectStore`s? 
   
   I have found that dynamic creation to be one of the more complicated things 
about datafusion-cli (and what users of DataFusion would have to figure out to 
recreate it). Maybe we can figure out some simpler API for that too (as another 
PR)



##########
datafusion/core/src/datasource/dynamic_file.rs:
##########
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! dynamic_file_schema contains an [`UrlTableFactory`] implementation that
+//! can create a [`ListingTable`] from the given url.
+
+use std::sync::Arc;
+
+use async_trait::async_trait;
+use datafusion_catalog::{SessionStore, UrlTableFactory};
+use datafusion_common::plan_datafusion_err;
+
+use crate::datasource::listing::{ListingTable, ListingTableConfig, 
ListingTableUrl};
+use crate::datasource::TableProvider;
+use crate::error::Result;
+use crate::execution::context::SessionState;
+
+/// [DynamicListTableFactory] is a factory that can create a [ListingTable] 
from the given url.
+#[derive(Default)]
+pub struct DynamicListTableFactory {
+    /// The session store that contains the current session.
+    session_store: Arc<SessionStore>,
+}
+
+impl DynamicListTableFactory {
+    /// Create a new [DynamicListTableFactory] with the given state store.
+    pub fn new(session_store: Arc<SessionStore>) -> Self {
+        Self { session_store }
+    }
+
+    fn session_store(&self) -> Arc<SessionStore> {
+        Arc::clone(&self.session_store)
+    }
+}
+
+#[async_trait]
+impl UrlTableFactory for DynamicListTableFactory {
+    async fn try_new(&self, url: &str) -> Result<Option<Arc<dyn 
TableProvider>>> {
+        let Ok(table_url) = ListingTableUrl::parse(url) else {
+            return Ok(None);
+        };
+
+        let state = &self
+            .session_store()
+            .get_session()
+            .upgrade()
+            .and_then(|session| {
+                session
+                    .read()
+                    .as_any()
+                    .downcast_ref::<SessionState>()

Review Comment:
   Yeah I agree another PR might make more sense. 
   
   Perhaps since SessionStore may not be needed longer term, we could simply 
have the field directly hold the session
   
   ```rust
   session: Arc<Mutex<Option<Weak<RwLock<dyn Session>>>>>;
   ```



##########
datafusion/sqllogictest/test_files/arrow_files.slt:
##########
@@ -43,6 +43,14 @@ SELECT * FROM arrow_simple
 3 baz false
 4 NULL true
 
+query ITB

Review Comment:
   I recommend creating a new `.slt` file explicitly for dynamic file catalog 
tests rather than extending the existing tests. This would:
   1. Help show that the behavior isn't changed except when dynamic catalog is 
enabled
   2. make it easier to evaluate how much dynamic file catalog testing there was
   
   Also, is there sufficient negative testing? Specifically that the dynamic 
catalog provider isn't enabled by default (which would be a security hole?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to