phillipleblanc opened a new issue, #13822:
URL: https://github.com/apache/datafusion/issues/13822

   ### Is your feature request related to a problem or challenge?
   
   Currently DataFusion only support queries with 3-levels of nesting, i.e. 
`SELECT * FROM catalog.schema.table`. Many catalog providers (i.e. Iceberg) 
allow any arbitrary level of nesting, i.e.:
   
   ```bash
   .
   ├── benchmarks
   │   └── tpcds
   │       ├── foo
   │       └── bar
   ├── spice
   │   ├── tpch
   │   │   ├── orders
   │   │   └── customers
   │   ├── info
   │   └── extra
   │       └── tpch_orders_metadata
   └── one
       └── two
           └── three
               └── four
                   └── five
   ```
   
   Attempting to represent this in DataFusion is tricky and several of the 
alternatives I considered (see below) are poor UX.
   
   ### Describe the solution you'd like
   
   I would like to be able to write a catalog provider that allows users to 
select any of the tables in the Iceberg catalog with a natural dot separated 
syntax. `sqlparser-rs` already supports parsing this, creating an `ObjectName` 
which is a `Vec<Ident>`. 
   
   There is a function in DataFusion 
[`idents_to_table_reference`](https://github.com/apache/datafusion/blob/main/datafusion/sql/src/planner.rs#L668)
 that is responsible for transforming the `Vec<Ident>` into a `TableReference`. 
   
   Its current implement looks like:
   
   ```rust
   pub(crate) fn idents_to_table_reference(
       idents: Vec<Ident>,
       enable_normalization: bool,
   ) -> Result<TableReference> {
       let mut taker = IdentTaker::new(idents, enable_normalization);
   
       match taker.len() {
           1 => {
               let table = taker.take();
               Ok(TableReference::bare(table))
           }
           2 => {
               let table = taker.take();
               let schema = taker.take();
               Ok(TableReference::partial(schema, table))
           }
           3 => {
               let table = taker.take();
               let schema = taker.take();
               let catalog = taker.take();
               Ok(TableReference::full(catalog, schema, table))
           }
           _ => plan_err!(
               "Unsupported compound identifier '{}'. Expected 1, 2 or 3 parts, 
got {}",
               taker,
               taker.len()
           ),
       }
   }
   ```
   
   Instead of erroring on >3 idents, I propose that we concatenate all of the 
"middle" namespaces into the `schema` part:
   
   ```rust
           // This won't compile as is, but demonstrates the idea
           _ => {
               // Concatenate all of the middle identifiers with a dot and put 
into the "schema" field
               let table = taker.take(enable_normalization);
               let schema = idents
                   .iter()
                   .skip(1)
                   .take(idents.len() - 2)
                   .map(|ident| {
                       
IdentNormalizer::new(enable_normalization).normalize(ident.clone())
                   })
                   .collect::<Vec<_>>()
                   .join(".");
               let catalog = taker.take(enable_normalization);
               Ok(TableReference::full(catalog, schema, table))
           }
   ```
   
   That would allow us to do `SELECT * FROM one.two.three.four.five` and have 
it be converted into a DF TableReference with:
   `catalog`: one
   `schema`: two.three.four
   `table`: five
   
   And any catalog provider implementations can know to create "schemas" to 
match this format to support arbitrarily nested namespaces.
   
   ### Describe alternatives you've considered
   
   I've considered in my implementation when integrating with a catalog 
provider to concatenate the middle namespaces with a `_` in the schema part so 
that `SELECT * FROM one.two.three.four.five` would become `SELECT * FROM 
one.two_three_four.five`.
   
   I've also considered concatenating the middle namspaces with `.` and not 
changing DataFusion, but that would require doing `SELECT * FROM 
one."two.three.four".five` which is also not an ideal UX.
   
   Another alternative if we didn't want this to be the default, is to allow 
users to customize the behavior of the `object_name_to_table_reference` 
function somehow.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to