By the way, initial performance testing on a CSV of string/binary columns gives 
the following ballpark numbers here (on a 8 core 3.2 GHz AMD Ryzen CPU):
* single-threaded: 150 MB/s
* multi-threaded: 600MB/s

(I didn't bother measuring with numeric columns, as the performance of our 
number parsing routines is likely to be very bad)

Clearly the main thread's chunking routine is the bottleneck in the 
multi-threaded scenario. Improving this will require adding an option to signal 
that values can't have newlines in them (as paratext does), and perhaps 
SIMD-accelerating that special case (as... paratext does ;-)).

Overall I have three directions in mind to improve performance:
* improve the main thread's chunking routine as described above
* pre-allocate parsing scratch spaces (and perhaps recycle them with a special 
MemoryPool)
* optimize number parsing routines


[ Full content available at: https://github.com/apache/arrow/pull/2576 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to