[GitHub] [arrow-rs] yordan-pavlov edited a comment on pull request #1082: Optimized ByteArrayReader (#1040)

GitBox Wed, 22 Dec 2021 13:30:17 -0800


yordan-pavlov edited a comment on pull request #1082:
URL: https://github.com/apache/arrow-rs/pull/1082#issuecomment-999877477



   @tustvold you are probably aware of this, but just to make sure it's not 
missed, when I run this branch with datafusion against a parquet file I get an 
error `Parquet argument error: Parquet error: unsupported encoding for byte 
array: PLAIN_DICTIONARY`
   
   Other than that, the performance benchmark results look impressive - I was 
able to run the benchmark and this branch is faster than the 
`ArrowArrayReader`, sometimes several times faster, in almost all cases 
(exceptions listed below). And the `ArrowArrayReader` was already several times 
faster in many cases than the old array reader implementation, making these 
performance results even more impressive.
   
   A major reason, why I only implemented `ArrowArrayReader` for string arrays 
is because I have been struggling to make it faster for dictionary-encoded 
primitive arrays, but it looks like this isn't going to be a problem with this 
new implementation.
   So if we can make it faster in all benchmarks, I am happy to abandon the 
`ArrowArrayReader` in favor of this new implementation.
   
   Where it is still a bit slower is in these two cases:
   
   read StringArray, plain encoded, mandatory, no NULLs - old: time:   [306.10 
us 342.14 us 377.28 us]
   read StringArray, plain encoded, mandatory, no NULLs - new: time:   [310.84 
us 337.49 us 368.74 us]
   
   read StringArray, dictionary encoded, mandatory, no NULLs - old: time:   
[286.61 us 320.07 us 354.74 us]
   read StringArray, dictionary encoded, mandatory, no NULLs - new: time:   
[222.87 us 240.56 us 260.93 us]
   
   The reason why `ArrowArrayReader` is fast in those cases, I suspect, is 
because when there are no nulls / def levels, the def level buffers are not 
read or processed at all, see here 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L566
 . This also means that the bit of code that produces the null bitmap also 
doesn't run, see here 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L595
 and the main path in the code is not concerned with null values at all, which 
is why it's so fast when there are no null / def levels, see here: 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L592
 , see string converter here 
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_array_reader.rs#L1164
 .
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] yordan-pavlov edited a comment on pull request #1082: Optimized ByteArrayReader (#1040)

Reply via email to