This aspect of writing Parquet datasets ought to have its own section
in the documentation

http://arrow.apache.org/docs/python/parquet.html

This would be a useful contribution to the project:
https://issues.apache.org/jira/browse/ARROW-3154
On Fri, Aug 31, 2018 at 5:57 PM Anton Goloborodko
<[email protected]> wrote:
>
> Oh, you are absolutely right, ParquetWriter takes a schema! Many thanks,
> it's really embarrassing that I did not notice it...
>
> On Fri, 31 Aug 2018 at 17:43, Wes McKinney <[email protected]> wrote:
>
> > hi Anton,
> >
> > Does pa.parquet.write_metadata not do what you want?
> >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L1205
> >
> > See also https://issues.apache.org/jira/browse/ARROW-1983
> >
> > - Wes
> > On Fri, Aug 31, 2018 at 5:38 PM Anton Goloborodko
> > <[email protected]> wrote:
> > >
> > > Dear Arrow developers,
> > >
> > > Our lab is planning to use pyarrow to store some biological information
> > in
> > > Parquet files. We also have to store some metadata alongside, e.g. which
> > > sample the data comes from, how it was obtained and processed, etc.
> > >
> > > Parquet seems to support file-wide metadata, but I cannot find how the
> > > write it via pyarrow. The closest thing I could find is how to write
> > > row-group metadata (https://github.com/pandas-dev/pandas/pull/20534),
> > but
> > > this seems like an overkill, since our metadata is the same for all row
> > > groups in the file.
> > >
> > > Is there any way to write file-wide Parquet metadata with pyarrow?
> > >
> > > Thank you!
> > > Anton.
> >

Reply via email to