Hi All, A few months ago I started a rewrite of how Dask manages Parquet reader/writers in an effort to simplify the system. This work is here: https://github.com/dask/dask/pull/4336
To summarize, Dask uses parquet reader libraries like pyarrow.parquet to provide scalable reading of parquet datasets in parallel. This requires both information about how to encode and decode bytes, but also on how to select row groups, grab data from S3/GCS/..., apply filters, find sorted index columns, and so on that are more commonly critical in a distributed setting. Previously the relationship between the two libraries was somewhat messy, where this logic was spread across in a haphazard way. This PR tries to draw pretty strict lines between the two libraries and establish a contract that hopefully we can stick to more easily in the future. For more information about that contract, I'd like to point people to the github issue. Things are looking pretty good so far, but there have been a few missing features in Arrow that would be really nice to be able to complete this rewrite. In particular two things have come up so far (though I'm sure that more will arise) 1. The ability to write a metadata file, given metadata collected from writing each row group. https://issues.apache.org/jira/browse/ARROW-1983 2. Getting statistics from types like unicode and datetime that may be stored differently from how users interpret them. https://issues.apache.org/jira/browse/ARROW-4139 My hope is that if we can resolve a few issues like this then we'll be able to simlify the relationship between the projects on both sides, reduce maintenance burden, and hopefully add improve the overall experience as well. Best, -matt (this came up again in https://github.com/apache/arrow/pull/3988#issuecomment-476696143)