[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266362#comment-16266362
 ] 

ASF GitHub Bot commented on ARROW-1854:
---------------------------------------

robertnishihara commented on issue #1360: ARROW-1854: [Python] Use pickle to 
serialize numpy arrays of objects.
URL: https://github.com/apache/arrow/pull/1360#issuecomment-347077436
 
 
   I don't have any example arrays at the moment. However, it feels like the 
kind of thing that will come up.
   
   A custom serialization context makes sense to me (or having the downstream 
application register a more performant but less general custom 
serializer/deserializer).
   
   @wesm In the scenario you're working on, are these numpy arrays of objects 
only being created by the pandas custom serializers? Or are they coming from 
somewhere else? If this mostly arises from pandas, handling this in the custom 
pandas serializers might solve the problem.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Improve performance of serializing object dtype ndarrays
> -----------------------------------------------------------------
>
>                 Key: ARROW-1854
>                 URL: https://issues.apache.org/jira/browse/ARROW-1854
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to