sergiimk opened a new issue, #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460
### Describe the bug
SQL is case-insensitive language, but the way case-insensitivity is
implemented for column identifiers in DataFusion often leads to non-intuitive
behavior:
### To Reproduce
Create `test.json`:
```json
{"A": "a_upper", "B": "b_upper", "b": "b_lower"}
```
Run `datafusion-cli`:
```sh
❯ create external table test stored as json location 'test.json';
❯ select * from test;
+---------+---------+---------+
| A | B | b |
+---------+---------+---------+
| a_upper | b_upper | b_lower |
+---------+---------+---------+
> select A from test;
Schema error: No field named a. Valid fields are test."A", test."B", test.b.
❯ select b from test;
+---------+
| b |
+---------+
| b_lower |
+---------+
❯ select "B" from test;
+---------+
| B |
+---------+
| b_upper |
+---------+
```
### Expected behavior
In Spark:
- When I load json like `{"A": "a_upper"}` both `select a from test` and
`select A from test` will return the "A" column.
- When I save to Parquet - the schema will preserve the column's original
case
- Loading `{"A": "a_upper", "B": "b_upper", "b": "b_lower"}` however fails
with duplicate column error - points to DF!
What I would expect as a user:
- `a` and `A` return "A" column
- `b` and `B` fail as ambiguous
- `"B"` and `"b"` work, returning upper and lower case columns respectively
i.e. case sensitivity should matter only when there is an ambiguity, and can
be resolved with quoted identifiers.
### Additional context
Currently after switching our ingest from Spark to Datafusion our
preprocessing code is full of `"X" as x, "Y" as y` just to restore
case-insensitivity for downstream queries.
I don't mind adding an automatic step that lower-cases all columns where
there is no ambiguty, but was wondering if core behavior should be adjusted
instead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]