Re: [PR] feat: Support reading CSV files with inconsistent column counts [datafusion]

via GitHub Thu, 02 Oct 2025 19:52:58 -0700


Jefffrey commented on PR #17553:
URL: https://github.com/apache/datafusion/pull/17553#issuecomment-3363962427


   > Hi @alamb — I just wanted to check if you had a chance to see my comment 
here:
   > 
   > > > I tried the reproducer from #17516 and it still fails on this PR:
   > > > Maybe I don't understand how to use it 🤔
   > > > ```sql
   > > > > create external table foo stored as csv location 
'/Users/andrewlamb/Downloads/services' options ('truncated_rows' true);
   > > > 0 row(s) fetched.
   > > > Elapsed 0.021 seconds.
   > > > 
   > > > > select * from foo limit 10;
   > > > Arrow error: Csv error: incorrect number of fields for line 1, 
expected 17 got 20
   > > > ```
   > > > 
   > > > 
   > > >     
   > > >       
   > > >     
   > > > 
   > > >       
   > > >     
   > > > 
   > > >     
   > > >   
   > > > It also errors if I just try to read the directory directly:
   > > > ```sql
   > > > (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo 
run --bin datafusion-cli
   > > >     Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.20s
   > > >      Running `target/debug/datafusion-cli`
   > > > DataFusion CLI v50.0.0
   > > > > select * from '/Users/andrewlamb/Downloads/services' limit 10;
   > > > Arrow error: Csv error: incorrect number of fields for line 1, 
expected 17 got 20
   > > > ```
   > > > 
   > > > 
   > > >     
   > > >       
   > > >     
   > > > 
   > > >       
   > > >     
   > > > 
   > > >     
   > > >   
   > > > This PR seems like a step in the right direction to me, it just 
doesn't seem to fix the problem entirely
   > > > It sounds like (as follow on issues / PRs) we probably would need to:
   > > > 
   > > > 1. Enable schema merging for CSV by default (
   > > > 2. Implement schema merge using column names (not positions) which is 
how parquet works, and I think what users would expect.
   > > 
   > > 
   > > Hi Andrew! I've reproduced the exact scenario and found the issue. The 
PR is working correctly for external tables, but there's a subtle distinction:
   > > What Works ✅
   > > CREATE EXTERNAL TABLE foo STORED AS CSV LOCATION 
'/Users/andrewlamb/Downloads/services' OPTIONS ('truncated_rows' 'true'); -- 
Note: 'true' in quotes SELECT * FROM foo LIMIT 10;
   > > What Still Fails ❌
   > > SELECT * FROM '/Users/andrewlamb/Downloads/services' LIMIT 10;
   > > The Issue
   > > You were getting the error because:
   > > 
   > > 1. Direct file path queries (SELECT * FROM '/path') don't support 
CSV-specific options like
   > >    truncated_rows - this is a separate limitation not addressed by this 
PR
   > > 2. Option syntax: Make sure to use 'truncated_rows' 'true' (with quotes 
around true) not
   > >    'truncated_rows' true
   > > 
   > > Testing
   > > I created files with exactly 17 vs 20 columns and confirmed:
   > > 
   > > * ✅ External table with OPTIONS ('truncated_rows' 'true') works 
perfectly - merges schemas and
   > >   fills missing columns with NULL
   > > * ❌ Direct path queries still fail with the same error you saw
   > > 
   > > Summary
   > > This PR does fix the core issue - CSV schema merging with different 
column counts works via external tables. The remaining limitation is that 
direct file path queries don't yet support format-specific options.
   > > If you think direct path query support is important for this PR, I'm 
happy to investigate adding that functionality here - it would involve 
enhancing how DataFusion handles table resolution for file paths to pass 
through CSV-specific options. Otherwise, try the external table approach with 
proper option syntax and it should work!
   > 
   > Whenever you get a moment, I would really appreciate your thoughts. 
Thanks! 🙏
   
   The quotes around `true` didn't fix it; the `limit 10` you placed on your 
query prevented it from reading multiple files hence you didn't encounter the 
error. I don't think the quotes around `true` matters, e.g.
   
   ```sh
   Downloads$ cat truncate.csv
   a,b,c
   1,2,3
   1,2
   1
   1,2,3
   ```
   
   In CLI:
   
   ```sql
   > select * from '/Users/jeffrey/Downloads/truncate.csv';
   Error when processing CSV file Users/jeffrey/Downloads/truncate.csv
   caused by
   Arrow error: Csv error: Encountered unequal lengths between records on CSV 
file. Expected 3 records, found 2 records at line 3
   > create or replace external table foo stored as csv location 
'/Users/jeffrey/Downloads/truncate.csv' options ('truncated_rows' 'true');
   0 row(s) fetched.
   Elapsed 0.006 seconds.
   
   > select * from foo;
   +---+------+------+
   | a | b    | c    |
   +---+------+------+
   | 1 | 2    | 3    |
   | 1 | 2    | NULL |
   | 1 | NULL | NULL |
   | 1 | 2    | 3    |
   +---+------+------+
   4 row(s) fetched.
   Elapsed 0.009 seconds.
   
   > create or replace external table foo stored as csv location 
'/Users/jeffrey/Downloads/truncate.csv' options ('truncated_rows' true);
   0 row(s) fetched.
   Elapsed 0.006 seconds.
   
   > select * from foo;
   +---+------+------+
   | a | b    | c    |
   +---+------+------+
   | 1 | 2    | 3    |
   | 1 | 2    | NULL |
   | 1 | NULL | NULL |
   | 1 | 2    | 3    |
   +---+------+------+
   4 row(s) fetched.
   Elapsed 0.008 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Support reading CSV files with inconsistent column counts [datafusion]

Reply via email to