[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571603#comment-17571603
 ] 

Clark Zinzow commented on ARROW-10739:
--------------------------------------

Hey folks, I'm the author of the Ray PR that [~pcmoritz] linked to, which 
essentially ports Arrow's buffer truncation in the IPC serialization path to 
Python as a custom pickle serializer. I'd be happy to help push on getting this 
fixed upstream for Arrow 10.0.0.

First, is there any in-progress work by [~jcrist] or others?

If not, I could take this on in the next month or so; the two implementation 
routes that I've thought of when looking at the IPC serialization code (these 
are basically the same routes that [~jorisvandenbossche] pointed out a year 
ago) are:
# Refactor the IPC writer's per-type buffer truncation logic into utilities 
that can be shared by the IPC serialization path and the pickle serialization 
path.
# Directly use the Arrow IPC format in its pickle serialization, where the 
pickle reducer is a light wrapper around the IPC serialization and 
deserialization hooks.

Do either of these routes sound appealing? (2) has the added benefits of 
consolidating the serialization schemes on the IPC format and pushing all 
expensive serialization code into C++ land, but is a larger change and would 
involve otherwise-unnecessary wrapping of plain Arrow (chunked) arrays in 
record batches in order to match the IPC format, so maybe (1) is the better 
option.

> [Python] Pickling a sliced array serializes all the buffers
> -----------------------------------------------------------
>
>                 Key: ARROW-10739
>                 URL: https://issues.apache.org/jira/browse/ARROW-10739
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Maarten Breddels
>            Assignee: Alessandro Molina
>            Priority: Critical
>             Fix For: 10.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 700004
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to