So, the way that things are structured here is similar to the way that the 
pandas CSV reader works. There are some issues with this:

* The conversion hot path involves strided memory access
* Conversions to Arrow varbinary involves an extra conversion

After this patch is merged, I would like to experiment with a "columnar" 
tokenization where we actually construct the Arrow varbinary layout on a 
column-by-column basis as we are tokenizing the data, so the following occurs:

* The type conversion hot path does not do strided memory access; conversion 
performance should be better
* If the column's type is string, no further copies are necessary

I don't know for sure, but I expect this to be a performance win. I can dig 
into this in the coming weeks

[ Full content available at: https://github.com/apache/arrow/pull/2576 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to