EeshanBembi commented on PR #17553:
URL: https://github.com/apache/datafusion/pull/17553#issuecomment-3363185863
Hi @alamb — I just wanted to check if you had a chance to see my comment
here:
> > I tried the reproducer from #17516 and it still fails on this PR:
> > Maybe I don't understand how to use it 🤔
> > ```sql
> > > create external table foo stored as csv location
'/Users/andrewlamb/Downloads/services' options ('truncated_rows' true);
> > 0 row(s) fetched.
> > Elapsed 0.021 seconds.
> >
> > > select * from foo limit 10;
> > Arrow error: Csv error: incorrect number of fields for line 1, expected
17 got 20
> > ```
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > It also errors if I just try to read the directory directly:
> > ```sql
> > (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run
--bin datafusion-cli
> > Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.20s
> > Running `target/debug/datafusion-cli`
> > DataFusion CLI v50.0.0
> > > select * from '/Users/andrewlamb/Downloads/services' limit 10;
> > Arrow error: Csv error: incorrect number of fields for line 1, expected
17 got 20
> > ```
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > This PR seems like a step in the right direction to me, it just doesn't
seem to fix the problem entirely
> > It sounds like (as follow on issues / PRs) we probably would need to:
> >
> > 1. Enable schema merging for CSV by default (
> > 2. Implement schema merge using column names (not positions) which is
how parquet works, and I think what users would expect.
>
> Hi Andrew! I've reproduced the exact scenario and found the issue. The PR
is working correctly for external tables, but there's a subtle distinction:
>
> What Works ✅
>
> CREATE EXTERNAL TABLE foo STORED AS CSV LOCATION
'/Users/andrewlamb/Downloads/services' OPTIONS ('truncated_rows' 'true'); --
Note: 'true' in quotes SELECT * FROM foo LIMIT 10;
>
> What Still Fails ❌
>
> SELECT * FROM '/Users/andrewlamb/Downloads/services' LIMIT 10;
>
> The Issue
>
> You were getting the error because:
>
> 1. Direct file path queries (SELECT * FROM '/path') don't support
CSV-specific options like
> truncated_rows - this is a separate limitation not addressed by this PR
> 2. Option syntax: Make sure to use 'truncated_rows' 'true' (with quotes
around true) not
> 'truncated_rows' true
>
> Testing
>
> I created files with exactly 17 vs 20 columns and confirmed:
>
> * ✅ External table with OPTIONS ('truncated_rows' 'true') works perfectly
- merges schemas and
> fills missing columns with NULL
> * ❌ Direct path queries still fail with the same error you saw
>
> Summary
>
> This PR does fix the core issue - CSV schema merging with different column
counts works via external tables. The remaining limitation is that direct file
path queries don't yet support format-specific options.
>
> If you think direct path query support is important for this PR, I'm happy
to investigate adding that functionality here - it would involve enhancing how
DataFusion handles table resolution for file paths to pass through CSV-specific
options. Otherwise, try the external table approach with proper option syntax
and it should work!
Whenever you get a moment, I would really appreciate your thoughts. Thanks! 🙏
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]