alamb opened a new issue, #17517:
URL: https://github.com/apache/datafusion/issues/17517

   ### Describe the bug
   
   I was playing around with the Datafusion CSV parser by using the example 
from https://duckdb.org/2025/09/08/duckdb-on-the-framework-laptop-13 but 
DataFusion refused to load it into parquet 
   
   ### To Reproduce
   
   Get the data
   ```shell
   wget https://blobs.duckdb.org/nl-railway/railway-services-80-months.zip
   unzip railway-services-80-months.zip
   ```
   
   Then run 
   ```shell
   mkdir services-parquet
   datafusion-cli
   ```
   
   Convert each file to parquet:
   
   ```sql
   COPY 'services/services-2019.csv' TO 
'services-parquet/services-2019.parquet';
   COPY 'services/services-2020.csv' TO 
'services-parquet/services-2020.parquet';
   COPY 'services/services-2021.csv' TO 
'services-parquet/services-2021.parquet';
   COPY 'services/services-2022.csv' TO 
'services-parquet/services-2022.parquet';
   COPY 'services/services-2023.csv' TO 
'services-parquet/services-2023.parquet';
   COPY 'services/services-2024.csv' TO 
'services-parquet/services-2024.parquet';
   COPY 'services/services-2025-01.csv' TO 
'services-parquet/services-2025-01.parquet';
   COPY 'services/services-2025-02.csv' TO 
'services-parquet/services-2025-02.parquet';
   COPY 'services/services-2025-03.csv' TO 
'services-parquet/services-2025-03.parquet';
   COPY 'services/services-2025-04.csv' TO 
'services-parquet/services-2025-04.parquet';
   COPY 'services/services-2025-05.csv' TO 
'services-parquet/services-2025-05.parquet';
   COPY 'services/services-2025-06.csv' TO 
'services-parquet/services-2025-07.parquet';
   COPY 'services/services-2025-07.csv' TO 
'services-parquet/services-2025-07.parquet';
   COPY 'services/services-2025-08.csv' TO 
'services-parquet/services-2025-08.parquet';
   ```
   
   And then run 
   
   ```sql
   DataFusion CLI v49.0.2
   > select * from 'services-parquet' limit 10;
   Arrow error: Schema error: Fail to merge schema field 'Stop:Arrival time' 
because the from data_type = Timestamp(Second, None) does not equal Utf8
   ```
   
   
   ### Expected behavior
   
   I expect to be able to read the data corrrectly 
   
   ### Additional context
   
   One error is that the the type of the `Stop: ArrivalTime` has been converted 
to something different in some of the different files. Sometimes it is a 
timestamp and sometimes a string:
   
   ```sql
   > describe 'services-parquet/services-2020.parquet';
   +------------------------------+-----------+-------------+
   | column_name                  | data_type | is_nullable |
   +------------------------------+-----------+-------------+
   | Service:RDT-ID               | Int64     | YES         |
   | Service:Date                 | Date32    | YES         |
   | Service:Type                 | Utf8View  | YES         |
   | Service:Company              | Utf8View  | YES         |
   | Service:Train number         | Int64     | YES         |
   | Service:Completely cancelled | Boolean   | YES         |
   | Service:Partly cancelled     | Boolean   | YES         |
   | Service:Maximum delay        | Int64     | YES         |
   | Stop:RDT-ID                  | Int64     | YES         |
   | Stop:Station code            | Utf8View  | YES         |
   | Stop:Station name            | Utf8View  | YES         |
   | Stop:Arrival time            | Utf8View  | YES         |
   | Stop:Arrival delay           | Utf8View  | YES         |
   | Stop:Arrival cancelled       | Utf8View  | YES         |
   | Stop:Departure time          | Utf8View  | YES         |
   | Stop:Departure delay         | Utf8View  | YES         |
   | Stop:Departure cancelled     | Utf8View  | YES         |
   +------------------------------+-----------+-------------+
   17 row(s) fetched.
   Elapsed 0.009 seconds.
   
   > describe 'services-parquet/services-2021.parquet';
   +------------------------------+-------------------------+-------------+
   | column_name                  | data_type               | is_nullable |
   +------------------------------+-------------------------+-------------+
   | Service:RDT-ID               | Int64                   | YES         |
   | Service:Date                 | Date32                  | YES         |
   | Service:Type                 | Utf8View                | YES         |
   | Service:Company              | Utf8View                | YES         |
   | Service:Train number         | Int64                   | YES         |
   | Service:Completely cancelled | Boolean                 | YES         |
   | Service:Partly cancelled     | Boolean                 | YES         |
   | Service:Maximum delay        | Int64                   | YES         |
   | Stop:RDT-ID                  | Int64                   | YES         |
   | Stop:Station code            | Utf8View                | YES         |
   | Stop:Station name            | Utf8View                | YES         |
   | Stop:Arrival time            | Timestamp(Second, None) | YES         |.  
<--- Note this field type is different
   | Stop:Arrival delay           | Int64                   | YES         |
   | Stop:Arrival cancelled       | Boolean                 | YES         |
   | Stop:Departure time          | Utf8View                | YES         |
   | Stop:Departure delay         | Utf8View                | YES         |
   | Stop:Departure cancelled     | Utf8View                | YES         |
   +------------------------------+-------------------------+-------------+
   17 row(s) fetched.
   Elapsed 0.008 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to