[jira] [Commented] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

Jim Crist (Jira) Tue, 03 May 2022 13:25:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531399#comment-17531399
 ]


Jim Crist commented on ARROW-10739:
-----------------------------------

We're running into this in Dask right now when attempting to integrate Pandas 
`string[pyarrow]`, since pickling pyarrow string arrays ends up pickling all 
the data even if the result only includes a small slice. I'm willing to hack on 
this if no one else has the bandwidth, but upon initial inspection this looks a 
bit more complicated than I'd like to bite off for a new-ish arrow contribute. 
With some guidance on the best path forward though I could possibly get 
something working though? [~jorisvandenbossche] any further thoughts on a 
solution here?

> [Python] Pickling a sliced array serializes all the buffers
> -----------------------------------------------------------
>
>                 Key: ARROW-10739
>                 URL: https://issues.apache.org/jira/browse/ARROW-10739
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Maarten Breddels
>            Priority: Major
>             Fix For: 9.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 700004
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-10739) [Python] Pickling a sliced array serializes all the buffers

Reply via email to