[I] Avro reader fails when query columns are reordered in SELECT statement [datafusion]

via GitHub Thu, 24 Apr 2025 02:42:43 -0700


nantunes opened a new issue, #15839:
URL: https://github.com/apache/datafusion/issues/15839


   ### Describe the bug
   
   When querying an Avro file table in DataFusion, column selection works fine 
when columns are in schema order or a subset of columns in any order. However, 
if the column order in the SELECT statement differs from the original schema 
order, it results in a type mismatch error.
   This happens because the current Avro reader implementation doesn't properly 
respect the ordering of columns specified in the projection when creating the 
RecordBatch. The reader creates arrays correctly but doesn't match them with 
the expected schema ordering.
   
   ### To Reproduce
   
   
   1. Create an Avro file with multiple columns of different types (e.g., 
username: string, tweet: string, timestamp: int64)
   2. Register it as a table in DataFusion
   3. Try different query patterns:
   
   ```
   // This works (all columns in original order)
   SELECT * FROM avro_file1
   +------------+-------------------------------------+------------+
   | username   | tweet                               | timestamp  |
   +------------+-------------------------------------+------------+
   | miguno     | Rock: Nerf paper, scissors is fine. | 1366150681 |
   | BlizzardCS | Works as intended.  Terran is IMBA. | 1366154481 |
   +------------+-------------------------------------+------------+
   
   // This works (subset of columns in original order)
   SELECT username, timestamp FROM avro_file1
   +------------+------------+
   | username   | timestamp  |
   +------------+------------+
   | miguno     | 1366150681 |
   | BlizzardCS | 1366154481 |
   +------------+------------+
   
   // This fails (reordered columns)
   SELECT timestamp, username FROM avro_file1
   ❌ column types must match schema types, expected Int64 but found Utf8 at 
column index 0
   ```
   
   ### Expected behavior
   
   All three queries should work correctly. The third query should return the 
columns in the order specified in the SELECT statement:
   
   ```
   +------------+------------+
   | timestamp  | username   |
   +------------+------------+
   | 1366150681 | miguno     |
   | 1366154481 | BlizzardCS |
   +------------+------------+
   ```
   
   ### Additional context
   
   The issue is in the Avro reader implementation, specifically in how it 
handles projections. When columns are reordered in the query, the reader 
creates arrays in the original schema order but the output schema expects them 
in the reordered sequence, leading to a type mismatch.
   
   This issue only affects the Avro reader - other formats like Parquet and CSV 
seem to handle column reordering correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Avro reader fails when query columns are reordered in SELECT statement [datafusion]

Reply via email to