[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266134#comment-16266134
 ] 

ASF GitHub Bot commented on ARROW-1854:
---------------------------------------

wesm commented on issue #1360: ARROW-1854: [Python] Use pickle to serialize 
numpy arrays of objects.
URL: https://github.com/apache/arrow/pull/1360#issuecomment-347029430
 
 
   I made some minor tweaks to send the pickle as a buffer rather than packing 
the bytes into the union:
   
   ```
   >>> arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
   
   >>> %timeit serialized = pa.serialize(arr)
   4.66 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   
   >>> %timeit pickle.dumps(arr, protocol=pickle.HIGHEST_PROTOCOL)
   4.53 ms ± 6.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   ```
   
   This seems pretty acceptable to me. Without this patch we have on my machine
   
   ```
   >>> %timeit serialized = pa.serialize(arr)
   24.1 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   The savings become more significant when there are repeated Python objects, 
I presume.
   
   How significant is the non-importable user-defined class issue? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Improve performance of serializing object dtype ndarrays
> -----------------------------------------------------------------
>
>                 Key: ARROW-1854
>                 URL: https://issues.apache.org/jira/browse/ARROW-1854
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to