phillipleblanc opened a new issue, #13822:
URL: https://github.com/apache/datafusion/issues/13822
### Is your feature request related to a problem or challenge?
Currently DataFusion only support queries with 3-levels of nesting, i.e.
`SELECT * FROM catalog.schema.table`. Many catalog providers (i.e. Iceberg)
allow any arbitrary level of nesting, i.e.:
```bash
.
├── benchmarks
│ └── tpcds
│ ├── foo
│ └── bar
├── spice
│ ├── tpch
│ │ ├── orders
│ │ └── customers
│ ├── info
│ └── extra
│ └── tpch_orders_metadata
└── one
└── two
└── three
└── four
└── five
```
Attempting to represent this in DataFusion is tricky and several of the
alternatives I considered (see below) are poor UX.
### Describe the solution you'd like
I would like to be able to write a catalog provider that allows users to
select any of the tables in the Iceberg catalog with a natural dot separated
syntax. `sqlparser-rs` already supports parsing this, creating an `ObjectName`
which is a `Vec<Ident>`.
There is a function in DataFusion
[`idents_to_table_reference`](https://github.com/apache/datafusion/blob/main/datafusion/sql/src/planner.rs#L668)
that is responsible for transforming the `Vec<Ident>` into a `TableReference`.
Its current implement looks like:
```rust
pub(crate) fn idents_to_table_reference(
idents: Vec<Ident>,
enable_normalization: bool,
) -> Result<TableReference> {
let mut taker = IdentTaker::new(idents, enable_normalization);
match taker.len() {
1 => {
let table = taker.take();
Ok(TableReference::bare(table))
}
2 => {
let table = taker.take();
let schema = taker.take();
Ok(TableReference::partial(schema, table))
}
3 => {
let table = taker.take();
let schema = taker.take();
let catalog = taker.take();
Ok(TableReference::full(catalog, schema, table))
}
_ => plan_err!(
"Unsupported compound identifier '{}'. Expected 1, 2 or 3 parts,
got {}",
taker,
taker.len()
),
}
}
```
Instead of erroring on >3 idents, I propose that we concatenate all of the
"middle" namespaces into the `schema` part:
```rust
// This won't compile as is, but demonstrates the idea
_ => {
// Concatenate all of the middle identifiers with a dot and put
into the "schema" field
let table = taker.take(enable_normalization);
let schema = idents
.iter()
.skip(1)
.take(idents.len() - 2)
.map(|ident| {
IdentNormalizer::new(enable_normalization).normalize(ident.clone())
})
.collect::<Vec<_>>()
.join(".");
let catalog = taker.take(enable_normalization);
Ok(TableReference::full(catalog, schema, table))
}
```
That would allow us to do `SELECT * FROM one.two.three.four.five` and have
it be converted into a DF TableReference with:
`catalog`: one
`schema`: two.three.four
`table`: five
And any catalog provider implementations can know to create "schemas" to
match this format to support arbitrarily nested namespaces.
### Describe alternatives you've considered
I've considered in my implementation when integrating with a catalog
provider to concatenate the middle namespaces with a `_` in the schema part so
that `SELECT * FROM one.two.three.four.five` would become `SELECT * FROM
one.two_three_four.five`.
I've also considered concatenating the middle namspaces with `.` and not
changing DataFusion, but that would require doing `SELECT * FROM
one."two.three.four".five` which is also not an ideal UX.
Another alternative if we didn't want this to be the default, is to allow
users to customize the behavior of the `object_name_to_table_reference`
function somehow.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]