yordan-pavlov opened a new pull request #384:
URL: https://github.com/apache/arrow-rs/pull/384
# Which issue does this PR close?
Closes #200.
# Rationale for this change
This PR attempts to implement a new, more efficient and also more generic
`ArrowArrayReader`, as a replacement to both the `PrimitiveArrayReader` and
`ComplexObjectArrayReader` that exist today. The basic idea behind the new
`ArrowArrayReader`
is to copy contiguous byte slices from parquet page buffers to arrow array
buffers as directly as possible, while avoiding unnecessary memory allocation
as much as possible. While for primitive types such as Int32, the performance
improvements are small in most cases, for complex types such as strings the
performance improvements can be significant (up to 6 times faster). See
benchmark results below.
I did try initially to use iterators end-to-end as suggested by the linked
issue, but this required a more complex and less efficient implementation which
was ultimately slower. This is why in this PR iterators are only used to map
parquet pages to implementations of the `ValueDecoder` trait trait which know
how to read / decode byte slices for batches of values.
# What changes are included in this PR?
This PR implements the new `ArrowArrayReader` and converters for strings and
primitive types, but is only used / enabled for strings. The plan is to enable
/ use the new `ArrowArrayReader` for more types in subsequent PRs. Also note
that `ValueDecoder`s for only `PLAIN` and `RLE_DICTIONARY` encodings are
currently implemented.
# Are there any user-facing changes?
There are some non-breaking changes to `MutableArrayData` and
`SlicesIterator`, @jorgecarleitao let me know what you think about those.
Here are the benchmark results:
read Int32Array, plain encoded, mandatory, no NULLs - old: time: [9.0238
us 9.1121 us 9.2100 us]
read Int32Array, plain encoded, mandatory, no NULLs - new: time: [6.9506
us 7.1606 us 7.4062 us]
read Int32Array, plain encoded, optional, no NULLs - old: time: [247.66 us
252.08 us 257.12 us]
read Int32Array, plain encoded, optional, no NULLs - new: time: [40.322 us
40.736 us 41.215 us]
read Int32Array, plain encoded, optional, half NULLs - old: time: [434.25
us 438.25 us 442.92 us]
read Int32Array, plain encoded, optional, half NULLs - new: time: [326.37
us 331.68 us 337.07 us]
read Int32Array, dictionary encoded, mandatory, no NULLs - old: time:
[38.876 us 39.698 us 40.805 us]
read Int32Array, dictionary encoded, mandatory, no NULLs - new: time:
[150.62 us 152.38 us 154.29 us]
read Int32Array, dictionary encoded, optional, no NULLs - old: time:
[265.18 us 267.54 us 270.16 us]
read Int32Array, dictionary encoded, optional, no NULLs - new: time:
[167.54 us 169.15 us 170.99 us]
read Int32Array, dictionary encoded, optional, half NULLs - old: time:
[442.66 us 446.42 us 450.47 us]
read Int32Array, dictionary encoded, optional, half NULLs - new: time:
[418.46 us 421.81 us 425.37 us]
read StringArray, plain encoded, mandatory, no NULLs - old: time: [1.6670
ms 1.6773 ms 1.6895 ms]
read StringArray, plain encoded, mandatory, no NULLs - new: time: [264.44
us 269.63 us 275.39 us]
read StringArray, plain encoded, optional, no NULLs - old: time: [1.8602
ms 1.8753 ms 1.8913 ms]
read StringArray, plain encoded, optional, no NULLs - new: time: [363.59
us 367.03 us 370.63 us]
read StringArray, plain encoded, optional, half NULLs - old: time: [1.5216
ms 1.5346 ms 1.5486 ms]
read StringArray, plain encoded, optional, half NULLs - new: time: [514.01
us 518.68 us 524.09 us]
read StringArray, dictionary encoded, mandatory, no NULLs - old: time:
[1.4903 ms 1.5129 ms 1.5358 ms]
read StringArray, dictionary encoded, mandatory, no NULLs - new: time:
[218.30 us 220.54 us 223.17 us]
read StringArray, dictionary encoded, optional, no NULLs - old: time:
[1.5652 ms 1.5776 ms 1.5912 ms]
read StringArray, dictionary encoded, optional, no NULLs - new: time:
[249.53 us 254.14 us 258.99 us]
read StringArray, dictionary encoded, optional, half NULLs - old: time:
[1.3585 ms 1.3945 ms 1.4318 ms]
read StringArray, dictionary encoded, optional, half NULLs - new: time:
[496.27 us 508.28 us 522.43 us]
@nevi-me @alamb @Dandandan let me know what you think.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]