[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

Robert Nishihara (JIRA) Sat, 25 Nov 2017 21:52:48 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265919#comment-16265919
 ]


Robert Nishihara commented on ARROW-1854:
-----------------------------------------

We may run into problems when the numpy array can't be pickled/unpickled but it 
can be cloudpickled/cloudunpickled. E.g.,

{code}
import numpy as np
import pickle
import cloudpickle

class Foo(object):
    pass

a = np.array([Foo()])
{code}

Pickle will succeed at pickling {{a}}, but it won't be able to unpickle it (in 
a different process). Cloudpickle will succeed but will be much slower. Our 
current approach will succeed and will be faster than cloudpickle.

> [Python] Improve performance of serializing object dtype ndarrays
> -----------------------------------------------------------------
>
>                 Key: ARROW-1854
>                 URL: https://issues.apache.org/jira/browse/ARROW-1854
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>             Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

Reply via email to