[ 
https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17654573#comment-17654573
 ] 

Apache Arrow JIRA Bot commented on ARROW-10739:
-----------------------------------------------

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [Python] Pickling a sliced array serializes all the buffers
> -----------------------------------------------------------
>
>                 Key: ARROW-10739
>                 URL: https://issues.apache.org/jira/browse/ARROW-10739
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Maarten Breddels
>            Assignee: Alessandro Molina
>            Priority: Critical
>             Fix For: 11.0.0
>
>
> If a large array is sliced, and pickled, it seems the full buffer is 
> serialized, this leads to excessive memory usage and data transfer when using 
> multiprocessing or dask.
> {code:java}
> >>> import pyarrow as pa
> >>> ar = pa.array(['foo'] * 100_000)
> >>> ar.nbytes
> 700004
> >>> import pickle
> >>> len(pickle.dumps(ar.slice(10, 1)))
> 700165
> NumPy for instance
> >>> import numpy as np
> >>> ar_np = np.array(ar)
> >>> ar_np
> array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object)
> >>> import pickle
> >>> len(pickle.dumps(ar_np[10:11]))
> 165{code}
> I think this makes sense if you know arrow, but kind of unexpected as a user.
> Is there a workaround for this? For instance copy an arrow array to get rid 
> of the offset, and trim the buffers?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to