So, the way that things are structured here is similar to the way that the pandas CSV reader works. There are some issues with this:
* The conversion hot path involves strided memory access * Conversions to Arrow varbinary involves an extra conversion After this patch is merged, I would like to experiment with a "columnar" tokenization where we actually construct the Arrow varbinary layout on a column-by-column basis as we are tokenizing the data, so the following occurs: * The type conversion hot path does not do strided memory access; conversion performance should be better * If the column's type is string, no further copies are necessary I don't know for sure, but I expect this to be a performance win. I can dig into this in the coming weeks [ Full content available at: https://github.com/apache/arrow/pull/2576 ] This message was relayed via gitbox.apache.org for [email protected]
