hi Richard,

We have been discussing this in

https://issues.apache.org/jira/browse/ARROW-1983

All that is currently missing is (AFAICT):

* A C++ function to write a vector of FileMetaData as a _metadata file
(make sure the file path is set in the metadata objects)
* A Python binding for this

This is a relatively low-complexity patch and does not require deep
understanding of the Parquet codebase, would someone like to submit a
pull request?

Thanks

On Thu, May 16, 2019 at 9:37 AM Richard Zamora <rzam...@nvidia.com> wrote:
>
> Note that I was asked to post here after making a similar comment on GitHub 
> (https://github.com/apache/arrow/pull/4236)…
>
> I am hoping to help improve the use of pyarrow.parquet within dask 
> (https://github.com/dask/dask). To this end, I put together a simple notebook 
> to explore how pyarrow.parquet can be used to read/write a partitioned 
> dataset without dask (see: 
> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
>   If your search for "Assuming that a single-file metadata solution is 
> currently missing" in that notebook, you will see where I am unsure of the 
> best way to write/read metadata to/from a centralized location using 
> pyarrow.parquet.
>
> I believe that it would be best for dask to have a way to read/write a single 
> metadata file for a partitioned dataset using pyarrow (perhaps a ‘_metadata’ 
> file?).   Am I correct to assume that: (1) this functionality is missing in 
> pyarrow, and (2) this  approach is the best way to process a partitioned 
> dataset in parallel?
>
> Best,
> Rick
>
> --
> Richard J. Zamora
> NVIDA
>
>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------

Reply via email to