Hello all,

I have been working on optimizing reads in spark to avoid spinning up lots
of short lived tasks that just perform row group pruning in selective
filter queries.

My high level question is why metadata summary files were marked deprecated in
this Parquet changeset? There isn't much explanation given or a
description of what should be used instead.
https://github.com/apache/parquet-mr/pull/429

There are other members of the broader parquet community that are also
confused by this deprecation, see this discussion in an arrow PR.
https://github.com/apache/arrow/pull/4166

In the course of making my small prototype I got an extra performance boost
by making spark write out metadata summary files, rather than having to
read all footers on the driver. This effect would be even more pronounced
on a completely remote storage system like S3. Writing these summary files
was disabled by default in SPARK-15719, because of the performance impact
of appending a small number of new files to an existing dataset with many
files.

https://issues.apache.org/jira/browse/SPARK-15719

This spark JIRA does make decent points considering how spark operates
today, but I think that there is a performance optimization opportunity
that is missed because the row group pruning is deferred to a bunch of
separate short lived tasks rather than done upfront, currently spark only
uses footers on the driver for schema merging.

Thanks for the help!
Jason Altekruse

Reply via email to