[
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571603#comment-17571603
]
Clark Zinzow commented on ARROW-10739:
--------------------------------------
Hey folks, I'm the author of the Ray PR that [~pcmoritz] linked to, which
essentially ports Arrow's buffer truncation in the IPC serialization path to
Python as a custom pickle serializer. I'd be happy to help push on getting this
fixed upstream for Arrow 10.0.0.
First, is there any in-progress work by [~jcrist] or others?
If not, I could take this on in the next month or so; the two implementation
routes that I've thought of when looking at the IPC serialization code (these
are basically the same routes that [~jorisvandenbossche] pointed out a year
ago) are:
# Refactor the IPC writer's per-type buffer truncation logic into utilities
that can be shared by the IPC serialization path and the pickle serialization
path.
# Directly use the Arrow IPC format in its pickle serialization, where the
pickle reducer is a light wrapper around the IPC serialization and
deserialization hooks.
Do either of these routes sound appealing? (2) has the added benefits of
consolidating the serialization schemes on the IPC format and pushing all
expensive serialization code into C++ land, but is a larger change and would
involve otherwise-unnecessary wrapping of plain Arrow (chunked) arrays in
record batches in order to match the IPC format, so maybe (1) is the better
option.
> [Python] Pickling a sliced array serializes all the buffers
> -----------------------------------------------------------
>
> Key: ARROW-10739
> URL: https://issues.apache.org/jira/browse/ARROW-10739
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Maarten Breddels
> Assignee: Alessandro Molina
> Priority: Critical
> Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is
> serialized, this leads to excessive memory usage and data transfer when using
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 700004
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid
> of the offset, and trim the buffers?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)