Help with zero-copy conversion of pyarrow table to pandas dataframe.

Bipin Mathew Fri, 28 Sep 2018 14:29:47 -0700

Hello Everyone,

     I am just getting my feet wet with apache arrow and I am running into
a bug or, more likely, simply misunderstanding the pyarrow api. I wrote out
a four column, million row apache arrow table to shared memory and I am
attempting to read it into a python dataframe. It is advertised that it is
possible to do this in a zero-copy manner, however, when I run the
to_pandas() method on the table I imported into pyarrow, my memory usage
increases, indicating that it did not actually do a zero-copy conversion.
Here is my code:


  1 import pyarrow as pa
>   2 import pandas as pd
>   3 import numpy as np
>   4 import time
>   5
>   6 start = time.time()
>   7 mm=pa.memory_map('/dev/shm/arrow_table')
>   8 b=mm.read_buffer()
>   9 reader = pa.RecordBatchStreamReader(b)
>  10 z = reader.read_all()
>  11 print("reading time: "+str(time.time()-start))
>  12
>  13 start = time.time()
>  14 df = z.to_pandas(zero_copy_only=True,use_threads=True)
>  15 print("conversion time: "+str(time.time()-start))


What am I doing wrong here? Or indeed am I simply misunderstanding what is
meant by zero-copy in this context? My frantic google efforts only resulted
in this possibly relevant issue, but it was unclear to me how it was
resolved:

https://github.com/apache/arrow/issues/1649

I am using pyarrow 0.10.0.

Regards,

Bipin

Help with zero-copy conversion of pyarrow table to pandas dataframe.

Reply via email to