yordan-pavlov commented on issue #200: URL: https://github.com/apache/arrow-rs/issues/200#issuecomment-842657902
UPDATE: over the weekend I implemented a slightly different idea, which appears to have unlocked a new level of performance: instead of having an iterator (of structs which are essentially references to continuous buffer regions), the iterator is just over pages. From then on, value bytes are read as byte slices (&[u8]) and passed to a callback function in a converter which just copies the byte slice into a MutableBuffer. This minimizes memory allocation and memory copy and also results in a significant performance improvement for string arrays. Also the time for my datafusion benchmark query has reduced further from 100ms to 70ms (it used to be 125ms before all this work). There is still an issue with the "read Int32Array, dictionary encoded, mandatory, no NULLs" benchmark, where the new version is still slower, but it is now faster than the previous implementation in all other cases (including "read Int32Array, plain encoded, mandatory, no NULLs" which used to be slower, because the old implementation was already fairly efficient). Over the next few days I will be looking into a few places in the new code, where I think further improvements could be made. Here are the latest benchmark results: read Int32Array, plain encoded, mandatory, no NULLs - old: time: [9.3360 us 9.4986 us 9.6921 us] read Int32Array, plain encoded, mandatory, no NULLs - new: time: [6.8815 us 6.9941 us 7.1260 us] read Int32Array, plain encoded, optional, no NULLs - old: time: [250.83 us 254.36 us 258.59 us] read Int32Array, plain encoded, optional, no NULLs - new: time: [49.452 us 49.547 us 49.686 us] read Int32Array, plain encoded, optional, half NULLs - old: time: [448.57 us 456.15 us 464.68 us] read Int32Array, plain encoded, optional, half NULLs - new: time: [340.68 us 349.96 us 361.22 us] read Int32Array, dictionary encoded, mandatory, no NULLs - old: time: [44.508 us 45.301 us 46.256 us] read Int32Array, dictionary encoded, mandatory, no NULLs - new: time: [162.29 us 164.37 us 166.87 us] read Int32Array, dictionary encoded, optional, no NULLs - old: time: [336.00 us 344.43 us 353.51 us] read Int32Array, dictionary encoded, optional, no NULLs - new: time: [233.54 us 241.86 us 251.34 us] read Int32Array, dictionary encoded, optional, half NULLs - old: time: [458.47 us 468.36 us 481.06 us] read Int32Array, dictionary encoded, optional, half NULLs - new: time: [464.21 us 470.32 us 477.61 us] read StringArray, plain encoded, mandatory, no NULLs - old: time: [1.5856 ms 1.5996 ms 1.6168 ms] read StringArray, plain encoded, mandatory, no NULLs - new: time: [312.25 us 314.47 us 317.58 us] read StringArray, plain encoded, optional, no NULLs - old: time: [1.7269 ms 1.7466 ms 1.7679 ms] read StringArray, plain encoded, optional, no NULLs - new: time: [332.59 us 335.79 us 339.89 us] read StringArray, plain encoded, optional, half NULLs - old: time: [1.4635 ms 1.4821 ms 1.5060 ms] read StringArray, plain encoded, optional, half NULLs - new: time: [533.63 us 540.17 us 548.34 us] read StringArray, dictionary encoded, mandatory, no NULLs - old: time: [1.4385 ms 1.4566 ms 1.4804 ms] read StringArray, dictionary encoded, mandatory, no NULLs - new: time: [410.96 us 417.04 us 423.86 us] read StringArray, dictionary encoded, optional, no NULLs - old: time: [1.5751 ms 1.5966 ms 1.6222 ms] read StringArray, dictionary encoded, optional, no NULLs - new: time: [456.19 us 462.95 us 470.83 us] read StringArray, dictionary encoded, optional, half NULLs - old: time: [1.3197 ms 1.3354 ms 1.3561 ms] read StringArray, dictionary encoded, optional, half NULLs - new: time: [585.26 us 595.95 us 608.60 us] And here are the latest changes: https://github.com/yordan-pavlov/arrow/commit/8f4dcb1b9b0fafb6df612b39231fb585163dd6fb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
