Re: (De)serialising schemas in pyarrow

Wes McKinney Thu, 19 Jul 2018 17:04:09 -0700

hi Andreas,

You want the read_schema method -- much simpler. Check out the unit tests:


https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L502

Would be nice to have this in the Sphinx docs

- Wes

On Wed, Jul 18, 2018 at 2:41 PM, Andreas Heider <andr...@heider.io> wrote:
> Hi,
>
> I'm using Arrow together with dask to quickly write lots of parquet files. 
> Pandas has a tendency to forget column types (in my case it's a string column 
> that might be completely null in some splits), so I'm building a Schema once 
> and then manually passing that Schema into pa.Table.from_pandas and 
> pq.ParquetWriter so all resulting files consistently have the same types.
>
> However, due to dask being distributed passing around that Schema involves 
> serialising the Schema and sending it to different processes, and this was a 
> bit harder than expected.
>
> Simple pickling fails with "No type alias for double" on unpickling.
>
> Schema does have a  .serialize(), but I can't find how to deserialize it 
> again? pa.deserialize says "Expected to read 923444752 metadata bytes but 
> only read 11284. It also looks like pa.deserialize is meant for Python 
> objects.
>
> So I've settled on this for now:
>
> def serialize_schema(schema):
>     sink = pa.BufferOutputStream()
>     writer = pa.RecordBatchStreamWriter(sink, schema)
>     writer.close()
>     return sink.get_result().to_pybytes()
>
> def deserialize_schema(buf):
>     buf_reader = pa.BufferReader(but)
>     reader = pa.RecordBatchStreamReader(but_reader)
>     return reader.schema
>
> This works, but is a bit more involved than I hoped it'd be.
>
> Do you have any advice how this is meant to work?
>
> Thanks,
> Andreas

Re: (De)serialising schemas in pyarrow

Reply via email to