jorgecarleitao commented on pull request #8710:
URL: https://github.com/apache/arrow/pull/8710#issuecomment-731554302


   * Converting `CSV -> StringArray -> [Type]Array` is not recommended, as it 
forces us to load everything in memory, even if there are shorter 
representations. Therefore, really need a way to build arrays out of CSV 
columns.
   
   * CSV is parsed as rows, but arrow is column-based. Therefore, there will 
need to be a pivot of the data at some point.
   
   My feeling is that there are wildly different specs out there into how we 
should convert a CSV column into an Array. IMO we should not try to solve all 
those use-cases ourselves and instead offer users the freedom to choose, as 
well as common utilities.
   
   As such, one idea is to offer a way to plugin that allow users to parse CSV 
column into `[Type]Array`, and offer a default offering.
   
   Since these are stateless, one simple idea is have the CSV reader accept a 
trait with two functions:
   
   ```rust
   infer: Fn(rows: &[StringRecord], col_idx: usize) -> DataType;
   convert: Fn(data_type: &DataType, rows: &[StringRecord], col_idx: usize) -> 
Result<ArrayRef>;
   # or something like this
   ```
   
   This signature indicates that:
   
   1. The function transverses rows
   2. the function is falible
   3. the resulting array is dynamic
   
   This allows the user to e.g. make unparsable rows as nulls, adopt specific 
notations for CSV files that are (for them) interoperable with Arrow, etc.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to