nevi-me opened a new pull request #381:
URL: https://github.com/apache/arrow-rs/pull/381
# Which issue does this PR close?
Closes #257.
# Rationale for this change
Parquet splits batches into row groups, which are normally determined by a
`max_row_group_size` setting.
The Arrow writer could not respect this setting because we cannot slice into
structs and arrays correctly.
The issue is that when using `array.slice(offset: usize, len: usize)`, we
don't propagate and calculate the slice of child data, leading to only the
top-level data being sliced.
# What changes are included in this PR?
We use the `LevelInfo` struct to keep track of its array's offset and
length. This allows us to track nested arrays' offsets, and calculate the
correct list offsets and lengths.
We then use the `arrow::array::slice` to perform 0-copy slices from a batch,
to limit the row group size correctly.
I have changed all writer tests to use a max row group size, ensuring that
we aren't introducing bugs when slicing.
Note that this is related to #225, but I don't think it quite covers all its
use-cases.
If we have a sliced recordbatch per #343, we would need to account for its
individual array offsets, as there is never a guarantee that a record batch has
all child arrays starting from the same offset.
# Are there any user-facing changes?
No. All changes are crate-internal.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]