Posnet opened a new issue, #5678: URL: https://github.com/apache/arrow-rs/issues/5678
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Add the ability to parse csv files that have flexible number of columns. Specifically a large subset of CSVs have columns missing from the ends of rows, and expect them to be treated as null. **Describe the solution you'd like** The ability to configure via the format or reader builder the option to enable flexible columns. **Describe alternatives you've considered** I have tried python and rust solutions, and while pandas and polars work for the general case, they both have poor support for streaming reads of csv into arrow buffers. Specifically they require either memory mapped files, or buffering most of the file in memory to work, unlike the convenience of the build/build_buffered methods offered by arrow-csv. And while the Rust csv crate is excellent, it is limited to row at a time parsing, and from basic testing I've done arrow-csv outperforms it when it comes to loading large datasets into arrow buffers. **Additional context** Example from other implementations: Rust csv https://docs.rs/csv/latest/csv/struct.ReaderBuilder.html#method.flexible https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv Pandas doesn't specify, but testing shows it allows missing trailing columns by default. Similarly Polars behaves the same as Pandas and Rust. https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html However one conflict is that the arrow-cpp csv parser doesn't allow ragged/flexible columns like the current arrow-csv. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
