Re: Dask and Arrow Parquet Rewrite

Wes McKinney Tue, 02 Apr 2019 10:41:33 -0700

hi Matt,

Thanks for this summary. It's important for the Dask user base to be
well supported dealing with Parquet files since this reflects a
sizable fraction of Arrow users at the moment. I also appreciate your
effort to establish cleaner boundaries between the projects to make
ongoing maintenance less painful. We should try to limit the exposure
of internal Parquet serialization details to external consumers.


Would anyone be able to volunteer to help with these items in the next
few weeks? I'm about to head out on vacation and won't be able to look
in detail until the last week of April at the earliest. Please keep in
mind that per my February discussion document, that I intend for the
"Parquet multi-file dataset" logic to be ported to C++ so that it can
be consumed equally well from R, C GLib, Ruby, and MATLAB [1]. So, I
think it's OK to continue building certain things in pure Python in
pyarrow/parquet.py but with the awareness that some of it may have to
be rewritten in C++ in the relatively near future.

- Wes

[1]: 
https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=sharing

On Tue, Mar 26, 2019 at 7:32 PM Matthew Rocklin <mrock...@gmail.com> wrote:
>
> Hi All,
>
> A few months ago I started a rewrite of how Dask manages Parquet
> reader/writers in an effort to simplify the system.  This work is here:
> https://github.com/dask/dask/pull/4336
>
> To summarize, Dask uses parquet reader libraries like pyarrow.parquet to
> provide scalable reading of parquet datasets in parallel.  This requires
> both information about how to encode and decode bytes, but also on how to
> select row groups, grab data from S3/GCS/..., apply filters, find sorted
> index columns, and so on that are more commonly critical in a distributed
> setting.  Previously the relationship between the two libraries was
> somewhat messy, where this logic was spread across in a haphazard way.
>
> This PR tries to draw pretty strict lines between the two libraries and
> establish a contract that hopefully we can stick to more easily in the
> future.  For more information about that contract, I'd like to point people
> to the github issue.
>
> Things are looking pretty good so far, but there have been a few missing
> features in Arrow that would be really nice to be able to complete this
> rewrite.  In particular two things have come up so far (though I'm sure
> that more will arise)
>
>    1. The ability to write a metadata file, given metadata collected from
>    writing each row group.  https://issues.apache.org/jira/browse/ARROW-1983
>    2. Getting statistics from types like unicode and datetime that may be
>    stored differently from how users interpret them.
>    https://issues.apache.org/jira/browse/ARROW-4139
>
> My hope is that if we can resolve a few issues like this then we'll be able
> to simlify the relationship between the projects on both sides, reduce
> maintenance burden, and hopefully add improve the overall experience as
> well.
>
> Best,
> -matt
>
> (this came up again in
> https://github.com/apache/arrow/pull/3988#issuecomment-476696143)

Re: Dask and Arrow Parquet Rewrite

Reply via email to