HippoBaro opened a new pull request, #9848:
URL: https://github.com/apache/arrow-rs/pull/9848
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax.
-->
- Contributes to #9731
- Depend on #9847
- Depend on #9846
# Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in
the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your
changes and offer better suggestions for fixes.
-->
Parquet list decoding currently materializes padding for null or empty
parent lists and then copies the child array to filter that padding back out.
This is expensive for nested list columns, especially sparse lists and
fixed-width children, where memory can scale with decoded levels instead of
actual emitted child values.
# What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is
sometimes worth providing a summary of the individual changes in this PR.
-->
This PR makes list child readers emit compact child arrays directly by
pushing selective null padding into the leaf `RecordReader`. It also builds
definition-level validity bitmaps word-at-a-time, sizes child buffers after
levels are decoded, and adds list runtime and peak-memory benchmarks across
element types and null densities.
# Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are
they covered by existing tests)?
-->
Extensive test coverage for the new logic through existing list reader
tests, which now exercise the production `PrimitiveArrayReader` path with
in-memory Parquet pages (see #9846), and `BooleanBufferBuilder::append_word`
has targeted unit coverage.
Benchmarks results:
```
Name Before After
Delta
ListArray/StringList/no NULLs 7.3395 ms 5.8001 ms
(-21.0%)
ListArray/StringList/half NULLs 4.1255 ms 3.4274 ms
(-16.9%)
ListArray/Int32List/90pct NULLs 1.0366 ms 975.63 us
(-5.9%)
ListArray/Fixed32List/no NULLs 4.9298 ms 3.3137 ms
(-32.8%)
ListArray/Fixed32List/half NULLs 2.8998 ms 2.7258 ms
(-6.0%)
ListArray/Fixed32List/90pct NULLs 1.0556 ms 995.82 us
(-5.7%)
ListArray/Fixed32List/99pct NULLs 508.65 us 467.16 us
(-8.2%)
Name Before After
Delta
ListArray_peak_memory/Int32List/no NULLs 836.51 KiB 574.79 KiB
(-31.3%)
ListArray_peak_memory/Int32List/half NULLs 482.01 KiB 336.29 KiB
(-30.2%)
ListArray_peak_memory/Int32List/90pct NULLs 271.95 KiB 175.32 KiB
(-35.5%)
ListArray_peak_memory/Int32List/99pct NULLs 217.96 KiB 120.04 KiB
(-44.9%)
ListArray_peak_memory/DoubleList/no NULLs 1.2399 MiB 715.31 KiB
(-43.7%)
ListArray_peak_memory/DoubleList/half NULLs 753.66 KiB 400.39 KiB
(-46.9%)
ListArray_peak_memory/DoubleList/90pct NULLs 380.62 KiB 190.89 KiB
(-49.8%)
ListArray_peak_memory/DoubleList/99pct NULLs 315.21 KiB 121.61 KiB
(-61.4%)
ListArray_peak_memory/Fixed32List/no NULLs 3.8031 MiB 1.5760 MiB
(-58.6%)
ListArray_peak_memory/Fixed32List/half NULLs 2.1710 MiB 849.94 KiB
(-61.8%)
ListArray_peak_memory/Fixed32List/90pct NULLs 1.0017 MiB 277.35 KiB
(-73.0%)
ListArray_peak_memory/Fixed32List/99pct NULLs 898.69 KiB 130.93 KiB
(-85.4%)
ListArray_peak_memory/StringList/no NULLs 3.7925 MiB 2.4715 MiB
(-34.8%)
ListArray_peak_memory/StringList/half NULLs 1.2541 MiB 772.94 KiB
(-39.8%)
ListArray_peak_memory/StringList/90pct NULLs 296.63 KiB 188.96 KiB
(-36.3%)
ListArray_peak_memory/StringList/99pct NULLs 226.75 KiB 120.37 KiB
(-46.9%)
Name Before After
Delta
ListArray_allocated_bytes/Int32List/no NULLs 10.458 MiB 6.8018 MiB
(-35.0%)
ListArray_allocated_bytes/Int32List/half NULLs 5.9797 MiB 4.0127 MiB
(-32.9%)
ListArray_allocated_bytes/Int32List/90pct NULLs 2.9985 MiB 1.8210 MiB
(-39.3%)
ListArray_allocated_bytes/Int32List/99pct NULLs 2.5579 MiB 1.3733 MiB
(-46.3%)
ListArray_allocated_bytes/DoubleList/no NULLs 16.083 MiB 8.6546 MiB
(-46.2%)
ListArray_allocated_bytes/DoubleList/half NULLs 8.9134 MiB 4.8497 MiB
(-45.6%)
ListArray_allocated_bytes/DoubleList/90pct NULLs 4.3656 MiB 2.0179 MiB
(-53.8%)
ListArray_allocated_bytes/DoubleList/99pct NULLs 3.7482 MiB 1.3903 MiB
(-62.9%)
ListArray_allocated_bytes/Fixed32List/no NULLs 49.441 MiB 19.505 MiB
(-60.5%)
ListArray_allocated_bytes/Fixed32List/half NULLs 26.846 MiB 10.459 MiB
(-61.0%)
ListArray_allocated_bytes/Fixed32List/90pct NULLs 12.483 MiB 3.1127 MiB
(-75.1%)
ListArray_allocated_bytes/Fixed32List/99pct NULLs 10.895 MiB 1.4980 MiB
(-86.3%)
ListArray_allocated_bytes/StringList/no NULLs 47.519 MiB 21.743 MiB
(-54.2%)
ListArray_allocated_bytes/StringList/half NULLs 19.097 MiB 10.478 MiB
(-45.1%)
ListArray_allocated_bytes/StringList/90pct NULLs 3.4203 MiB 2.1165 MiB
(-38.1%)
ListArray_allocated_bytes/StringList/99pct NULLs 2.6424 MiB 1.3777 MiB
(-47.9%)
```
# Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.
-->
None.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]