[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

ASF GitHub Bot (JIRA) Sun, 26 Nov 2017 20:22:09 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266349#comment-16266349
 ]


ASF GitHub Bot commented on ARROW-1854:
---------------------------------------

robertnishihara commented on issue #1360: ARROW-1854: [Python] Use pickle to 
serialize numpy arrays of objects.
URL: https://github.com/apache/arrow/pull/1360#issuecomment-347075327
 
 
   @wesm, you're right, the overhead comes from the fact that 
`pyarrow.serialize` doesn't handle duplication well. In the case where there is 
little or no duplication, `pyarrow.serialize` seems to outperform 
`pickle.dumps`. For example, see the following:
   
   ```python
   import numpy as np
   import pickle
   import pyarrow as pa
   
   arr = np.array([str(i) for i in range(300000)], dtype=object)
   
   %timeit pickle.dumps(arr)
   41.9 ms ± 750 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   %timeit pa.serialize(arr).to_buffer()
   30.9 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
   ```
   
   Given that the current code works in all cases (as far as I can tell) and is 
more performant in some cases, I still prefer the current code.
   
   However, if duplication is the common case for dataframes, then your 
optimization makes sense. In that case, would it be possible to move this code 
into the custom serializer for pandas dataframes instead of for numpy arrays? 
Or is that infeasible? If that is infeasible, I'd prefer to enable/disable this 
optimization with some sort of configuration flag because any reasonable cases 
that we can't handle will lead to complaints.
   
   Longer term, maybe this will all be solved by handling duplication properly.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Improve performance of serializing object dtype ndarrays
> -----------------------------------------------------------------
>
>                 Key: ARROW-1854
>                 URL: https://issues.apache.org/jira/browse/ARROW-1854
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1854) [Python] Improve performance of serializing object dtype ndarrays

Reply via email to