[I] Optionally support flexible column lengths [arrow-rs]

via GitHub Mon, 22 Apr 2024 08:31:08 -0700


Posnet opened a new issue, #5678:
URL: https://github.com/apache/arrow-rs/issues/5678


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Add the ability to parse csv files that have flexible number of columns. 
Specifically a large subset of CSVs have columns missing from the ends of rows, 
and expect them to be treated as null. 
   
   **Describe the solution you'd like**
   The ability to configure via the format or reader builder the option to 
enable flexible columns.
   
   **Describe alternatives you've considered**
   I have tried python and rust solutions, and while pandas and polars work for 
the general case, they both have poor support for streaming reads of csv into 
arrow buffers. Specifically they require either memory mapped files, or 
buffering most of the file in memory to work, unlike the convenience of the 
build/build_buffered methods offered by arrow-csv. And while the Rust csv crate 
is excellent, it is limited to row at a time parsing, and from basic testing 
I've done arrow-csv outperforms it when it comes to loading large datasets into 
arrow buffers. 
   
   **Additional context**
   Example from other implementations:
   
   Rust csv
   https://docs.rs/csv/latest/csv/struct.ReaderBuilder.html#method.flexible
   
   
   
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv
   
   Pandas doesn't specify, but testing shows it allows missing trailing columns 
by default.
   
   Similarly Polars behaves the same as Pandas and Rust.
   
   https://docs.pola.rs/py-polars/html/reference/api/polars.read_csv.html
   
   However one conflict is that the arrow-cpp csv parser doesn't allow 
ragged/flexible columns like the current arrow-csv.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Optionally support flexible column lengths [arrow-rs]

Reply via email to