[GitHub] [arrow] jorgecarleitao commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

GitBox Mon, 04 Jan 2021 08:46:00 -0800


jorgecarleitao commented on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754084930



   I'm curious about the perf implications. Even for integers or dates, we will 
always need to verify that they are utf8 compliant to create valid 
`StringArray` value buffers. We could store then as `BinaryArray` instead. My 
hypothesis is that most of the `to_str / to_datetime / to_int` are not SIMDed 
and thus unlikely to benefit from Arrow, but it could also be that the columnar 
format helps the compiler.
   
   Note that people can always pass a `DataType::Utf8` to the schema instead of 
inferring it and cast the types themselves. I always understood the readers 
(csv, json, parquet) as ways to bypass that approach and create Arrow arrays 
directly from the format. It happens that CSV is a particularly poor format for 
this, as everything is a (not-necessarily utf8) string with very little 
invariants.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorgecarleitao commented on pull request #9090: ARROW-11123: [Rust] Use cast kernel to simplify csv parser

Reply via email to