[
https://issues.apache.org/jira/browse/ARROW-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Antoine Pitrou updated ARROW-2449:
----------------------------------
Component/s: Python
> [Python] Efficiently serialize functions containing NumPy arrays
> -----------------------------------------------------------------
>
> Key: ARROW-2449
> URL: https://issues.apache.org/jira/browse/ARROW-2449
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.9.0
> Reporter: Richard Shin
> Priority: Major
>
> It is my understanding that pyarrow falls back to serializing functions (and
> other complex Python objects) using cloudpickle, which means that the
> contents of those functions are also serialized using the fallback method,
> rather than the efficient method described in
> [https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.]
> It would be good to get the benefit of fast zero-copy (de)serialization for
> objects like NumPy arrays contained inside functions.
> {code:java}
> In [1]: import numpy as np, pyarrow as pa
> In [2]: pa.__version__
> Out[2]: '0.9.0'
> In [3]: arr = np.random.rand(10000)
> In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer())
> The slowest run took 38.29 times longer than the fastest. This could mean
> that an intermediate result is being cached.
> 10000 loops, best of 3: 68.7 µs per loop
> In [5]: def arr_f(): return arr
> In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer())
> The slowest run took 5.89 times longer than the fastest. This could mean that
> an intermediate result is being cached.
> 1000 loops, best of 3: 539 µs per loop
> {code}
> For comparison:
> {code:java}
> In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr))
> 1000 loops, best of 3: 193 µs per loop
> In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f))
> The slowest run took 4.02 times longer than the fastest. This could mean that
> an intermediate result is being cached.
> 1000 loops, best of 3: 429 µs per loop
> {code}
> cc [~pcmoritz]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)