I have started to look through this. I think we're going to need to do some work on the design of the tokenizer hot path (I wrote the tokenizer that pandas uses, for example -- I probably wouldn't use the same design again -- so we have other data points to compare with). Luckily we have benchmarks and tests so we can refactor at will to try out different things and analyze that part in more depth.
[ Full content available at: https://github.com/apache/arrow/pull/2576 ] This message was relayed via gitbox.apache.org for [email protected]
