sergiimk opened a new issue, #7460:
URL: https://github.com/apache/arrow-datafusion/issues/7460

   ### Describe the bug
   
   SQL is case-insensitive language, but the way case-insensitivity is 
implemented for column identifiers in DataFusion often leads to non-intuitive 
behavior:
   
   ### To Reproduce
   
   Create `test.json`:
   ```json
   {"A": "a_upper", "B": "b_upper", "b": "b_lower"}
   ```
   Run `datafusion-cli`:
   ```sh
   ❯ create external table test stored as json location 'test.json';
   
   ❯ select * from test;
   +---------+---------+---------+
   | A       | B       | b       |
   +---------+---------+---------+
   | a_upper | b_upper | b_lower |
   +---------+---------+---------+
   
   > select A from test;
   Schema error: No field named a. Valid fields are test."A", test."B", test.b.
   
   ❯ select b from test;
   +---------+
   | b       |
   +---------+
   | b_lower |
   +---------+
   
   ❯ select "B" from test;
   +---------+
   | B       |
   +---------+
   | b_upper |
   +---------+
   ```
   
   ### Expected behavior
   
   In Spark:
   - When I load json like `{"A": "a_upper"}`  both `select a from test` and 
`select A from test` will return the "A" column.
   - When I save to Parquet - the schema will preserve the column's original 
case
   - Loading `{"A": "a_upper", "B": "b_upper", "b": "b_lower"}` however fails 
with duplicate column error - points to DF!
   
   What I would expect as a user:
   - `a` and `A` return "A" column
   - `b` and `B` fail as ambiguous
   - `"B"` and `"b"` work, returning upper and lower case columns respectively
   
   i.e. case sensitivity should matter only when there is an ambiguity, and can 
be resolved with quoted identifiers.
   
   ### Additional context
   
   Currently after switching our ingest from Spark to Datafusion our 
preprocessing code is full of `"X" as x, "Y" as y` just to restore 
case-insensitivity for downstream queries.
   
   I don't mind adding an automatic step that lower-cases all columns where 
there is no ambiguty, but was wondering if core behavior should be adjusted 
instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to