kirilklein opened a new pull request, #50159:
URL: https://github.com/apache/arrow/pull/50159
### Rationale
Pickling a sliced array currently serializes the array's *entire* parent
buffers rather than just the sliced range, because `Array.__reduce__` wraps
the
raw `ArrayData` buffers as-is. For a one-element slice of a large array this
serializes the whole parent (megabytes for a few bytes of data), which is a
long-standing pain point for multiprocessing / Dask / Ray (issue open since
2020). The IPC writer already truncates sliced buffers; pickling did not.
### What this does
Adds `arrow::internal::TrimArrayDataBuffers(ArrayData, MemoryPool*)`
(`cpp/src/arrow/array/util.{h,cc}`): when the array has a non-zero `offset`
(i.e. it is a slice sharing a parent's buffers) it compacts the array to the
referenced range via `Concatenate({MakeArray(data)})`, which already handles
every nested / variable-length / dictionary type correctly; otherwise it
returns the input unchanged. `Array.__reduce__` calls it before reducing, so
a
sliced array pickles only its referenced bytes. ChunkedArray / RecordBatch /
Table inherit the fix since they reduce through their arrays.
Pickle protocol-5 out-of-band buffers keep working (this is why the prior
IPC-based attempt, #37683, was rejected): we still reduce real `Buffer`
objects, just trimmed ones. Crucially, **unsliced arrays are returned
untouched**, so their protocol-5 pickling stays zero-copy
(`test_array_pickle_protocol5` keeps passing). The guard is `offset != 0`
rather than a buffer-size comparison precisely because allocator padding
makes
an unsliced array's referenced size differ from its total buffer size — a
size-based guard would needlessly copy (and break zero-copy for) unsliced
arrays.
**Known limitation:** a *zero-offset* head slice (`arr.slice(0, k)` of a
large
array) is not trimmed, since it cannot be distinguished from an unsliced
array
by `offset` alone without per-type buffer-size logic. This is no worse than
the
status quo (such slices already pickle the full buffers); the common case of
slices with a non-zero offset is fixed. A follow-up could trim these too via
a
proper per-type buffer-truncation utility.
### Design note for reviewers
Earlier discussion favored refactoring the IPC writer's per-type truncation
into a shared zero-copy visitor (Option 1). That refactor stalled for years
on
nested / dictionary handling. This PR instead reuses `Concatenate`, which
copies only the (small) compacted slice — the same bytes pickle would
serialize
anyway — for a much smaller, lower-risk change. Happy to evolve this toward a
zero-copy `SliceBuffer`-based utility if preferred.
### Benchmarks
Local source build, arrays of 2,000,000 elements, pickling a 10-element
slice:
| type | slice pickle before | after | shrink |
|------|--------------------:|------:|-------:|
| int64 | 16,000,141 | 200 | 80,001× |
| string | 32,889,061 | 247 | 133,154× |
| large_string | 40,889,071 | 297 | 137,674× |
| list<int64> | 40,000,247 | 405 | 98,766× |
| struct | 38,889,227 | 419 | 92,814× |
| dictionary | 8,001,235 | 1,236 | 6,473× |
| bool | 250,140 | 121 | 2,067× |
Containers inherit the fix: a sliced ChunkedArray / RecordBatch / Table
built on
a 2M-row array pickles to ~250–340 bytes (was tens of MB).
### Tests
Adds regression tests in `python/pyarrow/tests/test_array.py` covering slice
pickle size + round-trip across primitive / bool / string / list types,
protocol-5 out-of-band buffers, and the unsliced-array regression case. The
existing `test_array_pickle_protocol5` (zero-copy guarantee) continues to
pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]