[jira] [Commented] (ARROW-7885) [Python] pyarrow.serialize does not support dask dataframe

Antoine Pitrou (Jira) Wed, 19 Feb 2020 11:54:29 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-7885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040383#comment-17040383
 ]


Antoine Pitrou commented on ARROW-7885:
---------------------------------------

Perhaps we should update the PyArrow documentation. Nowadays you can do 
efficient zero-copy serialization using pickle protocol 5 and out-of-band 
buffers: https://docs.python.org/3/library/pickle.html#out-of-band-buffers

Even without using out-of-band buffers, pickle should make fewer copies than it 
used to. In any case, before believing PyArrow serialization is "more 
efficient", I would suggest you run benchmarks on your own data (for example a 
Pandas dataframe).

As for Parquet, you should be able to write it in memory if you call 
{{pyarrow.parquet.write_table}} with a {{pyarrow.BufferOutputStream}}. For 
example:
{code:python}
>>> tab = pa.Table.from_pydict({'a': list(range(10000))})                       
>>>                                                                             
>>>             
>>> stream = pa.BufferOutputStream()                                            
>>>                                                                             
>>>             
>>> pq.write_table(tab, stream)                                                 
>>>                                                                             
>>>             
>>> buf = stream.getvalue()                                                     
>>>                                                                             
>>>             
>>> buf.size                                                                    
>>>                                                                             
>>>             
58126
{code}


> [Python] pyarrow.serialize does not support dask dataframe
> ----------------------------------------------------------
>
>                 Key: ARROW-7885
>                 URL: https://issues.apache.org/jira/browse/ARROW-7885
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>            Reporter: Benjamin
>            Priority: Minor
>
> Currently pyarrow knows how to serialize pandas dataframes but not dask 
> dataframes.
> {code:java}
> SerializationCallbackError: pyarrow does not know how to serialize objects of 
> type <class 'dask.dataframe.core.DataFrame'>. {code}
> Pickling the dask dataframe foregoes the benefits of using pyarrow for the 
> sub dataframes.
> Pyarrow support for serializing dask dataframes would allow storing 
> dataframes efficiently in a database instead of a file system (e.g. parquet). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7885) [Python] pyarrow.serialize does not support dask dataframe

Reply via email to