Dask and Arrow Parquet Rewrite

Matthew Rocklin Tue, 26 Mar 2019 17:33:23 -0700

Hi All,

A few months ago I started a rewrite of how Dask manages Parquet
reader/writers in an effort to simplify the system.  This work is here:
https://github.com/dask/dask/pull/4336


To summarize, Dask uses parquet reader libraries like pyarrow.parquet to
provide scalable reading of parquet datasets in parallel.  This requires
both information about how to encode and decode bytes, but also on how to
select row groups, grab data from S3/GCS/..., apply filters, find sorted
index columns, and so on that are more commonly critical in a distributed
setting.  Previously the relationship between the two libraries was
somewhat messy, where this logic was spread across in a haphazard way.

This PR tries to draw pretty strict lines between the two libraries and
establish a contract that hopefully we can stick to more easily in the
future.  For more information about that contract, I'd like to point people
to the github issue.

Things are looking pretty good so far, but there have been a few missing
features in Arrow that would be really nice to be able to complete this
rewrite.  In particular two things have come up so far (though I'm sure
that more will arise)

   1. The ability to write a metadata file, given metadata collected from
   writing each row group.  https://issues.apache.org/jira/browse/ARROW-1983
   2. Getting statistics from types like unicode and datetime that may be
   stored differently from how users interpret them.
   https://issues.apache.org/jira/browse/ARROW-4139

My hope is that if we can resolve a few issues like this then we'll be able
to simlify the relationship between the projects on both sides, reduce
maintenance burden, and hopefully add improve the overall experience as
well.

Best,
-matt

(this came up again in
https://github.com/apache/arrow/pull/3988#issuecomment-476696143)

Dask and Arrow Parquet Rewrite

Reply via email to