By the way, initial performance testing on a CSV of string/binary columns gives the following ballpark numbers here (on a 8 core 3.2 GHz AMD Ryzen CPU): * single-threaded: 150 MB/s * multi-threaded: 600MB/s
(I didn't bother measuring with numeric columns, as the performance of our number parsing routines is likely to be very bad) Clearly the main thread's chunking routine is the bottleneck in the multi-threaded scenario. Improving this will require adding an option to signal that values can't have newlines in them (as paratext does), and perhaps SIMD-accelerating that special case (as... paratext does ;-)). Overall I have three directions in mind to improve performance: * improve the main thread's chunking routine as described above * pre-allocate parsing scratch spaces (and perhaps recycle them with a special MemoryPool) * optimize number parsing routines [ Full content available at: https://github.com/apache/arrow/pull/2576 ] This message was relayed via gitbox.apache.org for [email protected]
