hi Richard, We have been discussing this in
https://issues.apache.org/jira/browse/ARROW-1983 All that is currently missing is (AFAICT): * A C++ function to write a vector of FileMetaData as a _metadata file (make sure the file path is set in the metadata objects) * A Python binding for this This is a relatively low-complexity patch and does not require deep understanding of the Parquet codebase, would someone like to submit a pull request? Thanks On Thu, May 16, 2019 at 9:37 AM Richard Zamora <rzam...@nvidia.com> wrote: > > Note that I was asked to post here after making a similar comment on GitHub > (https://github.com/apache/arrow/pull/4236)… > > I am hoping to help improve the use of pyarrow.parquet within dask > (https://github.com/dask/dask). To this end, I put together a simple notebook > to explore how pyarrow.parquet can be used to read/write a partitioned > dataset without dask (see: > https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb). > If your search for "Assuming that a single-file metadata solution is > currently missing" in that notebook, you will see where I am unsure of the > best way to write/read metadata to/from a centralized location using > pyarrow.parquet. > > I believe that it would be best for dask to have a way to read/write a single > metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ > file?). Am I correct to assume that: (1) this functionality is missing in > pyarrow, and (2) this approach is the best way to process a partitioned > dataset in parallel? > > Best, > Rick > > -- > Richard J. Zamora > NVIDA > > > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and may > contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > -----------------------------------------------------------------------------------