Dandandan edited a comment on pull request #9090:
URL: https://github.com/apache/arrow/pull/9090#issuecomment-754102662


   @jorgecarleitao note that the `csv` `StringRecord` also verifies whether 
strings are utf8. It adds a bit of overhead, but the utf8 checking itself is 
not much for now, it is mostly the logic surrounding `StringRecord` that adds 
the most overhead.
   I think eventually we could use a `StringArray` or `BinaryArray` as buffer 
so we can remove the `StringRecords` which is internally a `Vec<u8>` (by using 
`ByteRecord`) and a `Vec<usize>` for the rows.
   
   The current performance penalty between master and this branch currently is 
~10% as we introduce an extra intermediate step which I think could be more 
than compensated for by removing the `StringRecord` abstraction, and trying to 
write to a string or binary array without intermediate steps.
   
   This is the structure the csv crate is using per row:
   
   ```rust
   struct ByteRecordInner {
       /// The position of this byte record.
       pos: Option<Position>,
       /// All fields in this record, stored contiguously.
       fields: Vec<u8>,
       /// The number of and location of each field in this record.
       bounds: Bounds,
   }
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to