kirilklein opened a new pull request, #50159:
URL: https://github.com/apache/arrow/pull/50159

   ### Rationale
   
   Pickling a sliced array currently serializes the array's *entire* parent
   buffers rather than just the sliced range, because `Array.__reduce__` wraps 
the
   raw `ArrayData` buffers as-is. For a one-element slice of a large array this
   serializes the whole parent (megabytes for a few bytes of data), which is a
   long-standing pain point for multiprocessing / Dask / Ray (issue open since
   2020). The IPC writer already truncates sliced buffers; pickling did not.
   
   ### What this does
   
   Adds `arrow::internal::TrimArrayDataBuffers(ArrayData, MemoryPool*)`
   (`cpp/src/arrow/array/util.{h,cc}`): when the array has a non-zero `offset`
   (i.e. it is a slice sharing a parent's buffers) it compacts the array to the
   referenced range via `Concatenate({MakeArray(data)})`, which already handles
   every nested / variable-length / dictionary type correctly; otherwise it
   returns the input unchanged. `Array.__reduce__` calls it before reducing, so 
a
   sliced array pickles only its referenced bytes. ChunkedArray / RecordBatch /
   Table inherit the fix since they reduce through their arrays.
   
   Pickle protocol-5 out-of-band buffers keep working (this is why the prior
   IPC-based attempt, #37683, was rejected): we still reduce real `Buffer`
   objects, just trimmed ones. Crucially, **unsliced arrays are returned
   untouched**, so their protocol-5 pickling stays zero-copy
   (`test_array_pickle_protocol5` keeps passing). The guard is `offset != 0`
   rather than a buffer-size comparison precisely because allocator padding 
makes
   an unsliced array's referenced size differ from its total buffer size — a
   size-based guard would needlessly copy (and break zero-copy for) unsliced
   arrays.
   
   **Known limitation:** a *zero-offset* head slice (`arr.slice(0, k)` of a 
large
   array) is not trimmed, since it cannot be distinguished from an unsliced 
array
   by `offset` alone without per-type buffer-size logic. This is no worse than 
the
   status quo (such slices already pickle the full buffers); the common case of
   slices with a non-zero offset is fixed. A follow-up could trim these too via 
a
   proper per-type buffer-truncation utility.
   
   ### Design note for reviewers
   
   Earlier discussion favored refactoring the IPC writer's per-type truncation
   into a shared zero-copy visitor (Option 1). That refactor stalled for years 
on
   nested / dictionary handling. This PR instead reuses `Concatenate`, which
   copies only the (small) compacted slice — the same bytes pickle would 
serialize
   anyway — for a much smaller, lower-risk change. Happy to evolve this toward a
   zero-copy `SliceBuffer`-based utility if preferred.
   
   ### Benchmarks
   
   Local source build, arrays of 2,000,000 elements, pickling a 10-element 
slice:
   
   | type | slice pickle before | after | shrink |
   |------|--------------------:|------:|-------:|
   | int64 | 16,000,141 | 200 | 80,001× |
   | string | 32,889,061 | 247 | 133,154× |
   | large_string | 40,889,071 | 297 | 137,674× |
   | list<int64> | 40,000,247 | 405 | 98,766× |
   | struct | 38,889,227 | 419 | 92,814× |
   | dictionary | 8,001,235 | 1,236 | 6,473× |
   | bool | 250,140 | 121 | 2,067× |
   
   Containers inherit the fix: a sliced ChunkedArray / RecordBatch / Table 
built on
   a 2M-row array pickles to ~250–340 bytes (was tens of MB).
   
   ### Tests
   
   Adds regression tests in `python/pyarrow/tests/test_array.py` covering slice
   pickle size + round-trip across primitive / bool / string / list types,
   protocol-5 out-of-band buffers, and the unsliced-array regression case. The
   existing `test_array_pickle_protocol5` (zero-copy guarantee) continues to 
pass.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to