[jira] [Created] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

Maarten Breddels (Jira) Wed, 25 Nov 2020 12:16:08 -0800

Maarten Breddels created ARROW-10739:
----------------------------------------


             Summary: [Python] Pickling a sliced array serializes all the 
buffers
                 Key: ARROW-10739
                 URL: https://issues.apache.org/jira/browse/ARROW-10739
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Maarten Breddels


If a large array is sliced, and pickled, it seems the full buffer is 
serialized, this leads to excessive memory usage and data transfer when using 
multiprocessing or dask.
{code:java}
>>> import pyarrow as pa
>>> ar = pa.array(['foo'] * 100_000)
>>> ar.nbytes
700004
>>> import pickle
>>> len(pickle.dumps(ar.slice(10, 1)))
700165

NumPy for instance
>>> import numpy as np
>>> ar_np = np.array(ar)
>>> ar_np
array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
>>> import pickle
>>> len(pickle.dumps(ar_np[10:11]))
165{code}
I think this makes sense if you know arrow, but kind of unexpected as a user.

Is there a workaround for this? For instance copy an arrow array to get rid of 
the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

Reply via email to