yordan-pavlov edited a comment on pull request #1082: URL: https://github.com/apache/arrow-rs/pull/1082#issuecomment-999877477
@tustvold you are probably aware of this, but just to make sure it's not missed, when I run this branch with datafusion against a parquet file I get an error `Parquet argument error: Parquet error: unsupported encoding for byte array: PLAIN_DICTIONARY` Other than that, the performance benchmark results look impressive - I was able to run the benchmark and this branch is faster than the `ArrowArrayReader`, sometimes several times faster, in almost all cases (exceptions listed below). And the `ArrowArrayReader` was already several times faster in many cases than the old array reader implementation, making these performance results even more impressive. A major reason, why I only implemented `ArrowArrayReader` for string arrays is because I have been struggling to make it faster for dictionary-encoded primitive arrays, but it looks like this isn't going to be a problem with this new implementation. So if we can make it faster in all benchmarks, I am happy to abandon the `ArrowArrayReader` in favor of this new implementation. Where it is still a bit slower is in these two cases: read StringArray, plain encoded, mandatory, no NULLs - old: time: [306.10 us 342.14 us 377.28 us] read StringArray, plain encoded, mandatory, no NULLs - new: time: [310.84 us 337.49 us 368.74 us] read StringArray, dictionary encoded, mandatory, no NULLs - old: time: [286.61 us 320.07 us 354.74 us] read StringArray, dictionary encoded, mandatory, no NULLs - new: time: [222.87 us 240.56 us 260.93 us] The reason why `ArrowArrayReader` is fast in those cases, I suspect, is because when there are no nulls / def levels, the def level buffers are not read or processed at all, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L566 . This also means that the bit of code that produces the null bitmap also doesn't run, see here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L595 and the main path in the code is not concerned with null values at all, which is why it's so fast when there are no null / def levels, see here: https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L592 , see string converter here https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1164 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org