yordan-pavlov opened a new pull request #384:
URL: https://github.com/apache/arrow-rs/pull/384


   # Which issue does this PR close?
   
   Closes #200.
   
   # Rationale for this change
   This PR attempts to implement a new, more efficient and also more generic 
`ArrowArrayReader`, as a replacement to both the `PrimitiveArrayReader` and 
`ComplexObjectArrayReader` that exist today. The basic idea behind the new 
`ArrowArrayReader` 
    is to copy contiguous byte slices from parquet page buffers to arrow array 
buffers as directly as possible, while avoiding unnecessary memory allocation 
as much as possible. While for primitive types such as Int32, the performance 
improvements are small in most cases, for complex types such as strings the 
performance improvements can be significant (up to 6 times faster). See 
benchmark results below.
   
   I did try initially to use iterators end-to-end as suggested by the linked 
issue, but this required a more complex and less efficient implementation which 
was ultimately slower. This is why in this PR iterators are only used to map 
parquet pages to implementations of the `ValueDecoder` trait trait which know 
how to read / decode byte slices for batches of values.
   
   # What changes are included in this PR?
   This PR implements the new `ArrowArrayReader` and converters for strings and 
primitive types, but is only used / enabled for strings. The plan is to enable 
/ use the new `ArrowArrayReader` for more types in subsequent PRs. Also note 
that `ValueDecoder`s for only `PLAIN` and `RLE_DICTIONARY` encodings are 
currently implemented.
   
   
   # Are there any user-facing changes?
   There are some non-breaking changes to `MutableArrayData` and 
`SlicesIterator`, @jorgecarleitao  let me know what you think about those.
   
   Here are the benchmark results:
   read Int32Array, plain encoded, mandatory, no NULLs - old: time:   [9.0238 
us 9.1121 us 9.2100 us]
   read Int32Array, plain encoded, mandatory, no NULLs - new: time:   [6.9506 
us 7.1606 us 7.4062 us]
   
   read Int32Array, plain encoded, optional, no NULLs - old: time:   [247.66 us 
252.08 us 257.12 us]
   read Int32Array, plain encoded, optional, no NULLs - new: time:   [40.322 us 
40.736 us 41.215 us]
   
   read Int32Array, plain encoded, optional, half NULLs - old: time:   [434.25 
us 438.25 us 442.92 us]
   read Int32Array, plain encoded, optional, half NULLs - new: time:   [326.37 
us 331.68 us 337.07 us]
   
   read Int32Array, dictionary encoded, mandatory, no NULLs - old: time:   
[38.876 us 39.698 us 40.805 us]
   read Int32Array, dictionary encoded, mandatory, no NULLs - new: time:   
[150.62 us 152.38 us 154.29 us]
   
   read Int32Array, dictionary encoded, optional, no NULLs - old: time:   
[265.18 us 267.54 us 270.16 us]
   read Int32Array, dictionary encoded, optional, no NULLs - new: time:   
[167.54 us 169.15 us 170.99 us]
   
   read Int32Array, dictionary encoded, optional, half NULLs - old: time:   
[442.66 us 446.42 us 450.47 us]
   read Int32Array, dictionary encoded, optional, half NULLs - new: time:   
[418.46 us 421.81 us 425.37 us]
   
   read StringArray, plain encoded, mandatory, no NULLs - old: time:   [1.6670 
ms 1.6773 ms 1.6895 ms]
   read StringArray, plain encoded, mandatory, no NULLs - new: time:   [264.44 
us 269.63 us 275.39 us]
   
   read StringArray, plain encoded, optional, no NULLs - old: time:   [1.8602 
ms 1.8753 ms 1.8913 ms]
   read StringArray, plain encoded, optional, no NULLs - new: time:   [363.59 
us 367.03 us 370.63 us]
   
   read StringArray, plain encoded, optional, half NULLs - old: time:   [1.5216 
ms 1.5346 ms 1.5486 ms]
   read StringArray, plain encoded, optional, half NULLs - new: time:   [514.01 
us 518.68 us 524.09 us]
   
   read StringArray, dictionary encoded, mandatory, no NULLs - old: time:   
[1.4903 ms 1.5129 ms 1.5358 ms]
   read StringArray, dictionary encoded, mandatory, no NULLs - new: time:   
[218.30 us 220.54 us 223.17 us]
   
   read StringArray, dictionary encoded, optional, no NULLs - old: time:   
[1.5652 ms 1.5776 ms 1.5912 ms]
   read StringArray, dictionary encoded, optional, no NULLs - new: time:   
[249.53 us 254.14 us 258.99 us]
   
   read StringArray, dictionary encoded, optional, half NULLs - old: time:   
[1.3585 ms 1.3945 ms 1.4318 ms]
   read StringArray, dictionary encoded, optional, half NULLs - new: time:   
[496.27 us 508.28 us 522.43 us]
   
   @nevi-me @alamb @Dandandan let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to