[
https://issues.apache.org/jira/browse/ARROW-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Shin updated ARROW-2449:
--------------------------------
Description:
It is my understanding that pyarrow falls back to serializing functions (and
other complex Python objects) using cloudpickle, which means that the contents
of those functions are also serialized using the fallback method, rather than
the efficient method described in
[https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.]
It would be good to get the benefit of fast zero-copy (de)serialization for
objects like NumPy arrays contained inside functions.
{code:java}
In [1]: import numpy as np, pyarrow as pa
In [2]: pa.__version__
Out[2]: '0.9.0'
In [3]: arr = np.random.rand(10000)
In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer())
The slowest run took 38.29 times longer than the fastest. This could mean that
an intermediate result is being cached.
10000 loops, best of 3: 68.7 µs per loop
In [5]: def arr_f(): return arr
In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer())
The slowest run took 5.89 times longer than the fastest. This could mean that
an intermediate result is being cached.
1000 loops, best of 3: 539 µs per loop
{code}
For comparison:
{code:java}
In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr))
1000 loops, best of 3: 193 µs per loop
In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f))
The slowest run took 4.02 times longer than the fastest. This could mean that
an intermediate result is being cached.
1000 loops, best of 3: 429 µs per loop
{code}
cc [~pcmoritz]
was:
It is my understanding that pyarrow falls back to serializing functions (and
other complex Python objects) using cloudpickle, which means that the contents
of those functions are also serialized using the fallback method, rather than
the efficient method described in
[https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.]
It would be good to get the benefit of fast zero-copy (de)serialization for
objects like NumPy arrays contained inside functions.
{code}
In [1]: import numpy as np, pyarrow as pa
In [2]: pa.__version__
Out[2]: '0.9.0'
In [3]: arr = np.random.rand(10000)
In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer())
The slowest run took 38.29 times longer than the fastest. This could mean that
an intermediate result is being cached.
10000 loops, best of 3: 68.7 µs per loop
In [5]: def arr_f(): return arr
In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer())
The slowest run took 5.89 times longer than the fastest. This could mean that
an intermediate result is being cached.
1000 loops, best of 3: 539 µs per loop
{code}
For comparison:
{code}
In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr))
1000 loops, best of 3: 193 µs per loop
In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f))
The slowest run took 4.02 times longer than the fastest. This could mean that
an intermediate result is being cached.
1000 loops, best of 3: 429 µs per loop
{code}
> [Python] Efficiently serialize functions containing NumPy arrays
> -----------------------------------------------------------------
>
> Key: ARROW-2449
> URL: https://issues.apache.org/jira/browse/ARROW-2449
> Project: Apache Arrow
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Richard Shin
> Priority: Major
>
> It is my understanding that pyarrow falls back to serializing functions (and
> other complex Python objects) using cloudpickle, which means that the
> contents of those functions are also serialized using the fallback method,
> rather than the efficient method described in
> [https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.]
> It would be good to get the benefit of fast zero-copy (de)serialization for
> objects like NumPy arrays contained inside functions.
> {code:java}
> In [1]: import numpy as np, pyarrow as pa
> In [2]: pa.__version__
> Out[2]: '0.9.0'
> In [3]: arr = np.random.rand(10000)
> In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer())
> The slowest run took 38.29 times longer than the fastest. This could mean
> that an intermediate result is being cached.
> 10000 loops, best of 3: 68.7 µs per loop
> In [5]: def arr_f(): return arr
> In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer())
> The slowest run took 5.89 times longer than the fastest. This could mean that
> an intermediate result is being cached.
> 1000 loops, best of 3: 539 µs per loop
> {code}
> For comparison:
> {code:java}
> In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr))
> 1000 loops, best of 3: 193 µs per loop
> In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f))
> The slowest run took 4.02 times longer than the fastest. This could mean that
> an intermediate result is being cached.
> 1000 loops, best of 3: 429 µs per loop
> {code}
> cc [~pcmoritz]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)