Hey Jason, I'd suggest you look at Apache Iceberg. It is a much more mature way of handling metadata efficiency issues and provides a substantial superset of functionality over the old metadata cache files.
On Wed, Sep 23, 2020 at 4:16 PM Jason Altekruse <altekruseja...@gmail.com> wrote: > Hello again, > > I took a look through the mail archives and found a little more information > in this and a few other threads. > > > http://mail-archives.apache.org/mod_mbox//parquet-dev/201707.mbox/%3CCAO4re1k8-bZZZWBRuLCnm1V7AoJE1fdunSuBn%2BecRuFGPgcXnA%40mail.gmail.com%3E > > While I do understand the benefits for federating out the reading of > footers for the sake of not worrying about synchronization between the > cached metadata and any changes to the files on disk, it does appear there > is still a use case that isn't solved well with this design, needle in a > haystack selective filter queries, where the data is sorted by the filter > column. For example in the tests I ran with queries against lots of parquet > files where the vast majority are pruned by a bunch of small tasks, it > takes 33 seconds vs just 1-2 seconds with driver side pruning using the > summary file (requires a small spark changet). > > In our use case we are never going to be replacing contents of existing > parquet files (with a delete and rewrite on HDFS) or appending new row > groups onto existing files. In that case I don't believe we should > experience any correctness problems, but I wanted to confirm if there is > something I am missing. I am > using readAllFootersInParallelUsingSummaryFiles which does fall back to > read individual footers if they are missing from the summary file. > > I am also curious if a solution to the correctness problems could be to > include a file length and/or last modified time into the summary file, > which could confirm against FS metadata that the files on disk are still in > sync with the metadata summary relatively quickly. Would it be possible to > consider avoiding this deprecation if I was to work on an update to > implement this? > > - Jason Altekruse > > > On Tue, Sep 15, 2020 at 8:52 PM Jason Altekruse <altekruseja...@gmail.com> > wrote: > > > Hello all, > > > > I have been working on optimizing reads in spark to avoid spinning up > lots > > of short lived tasks that just perform row group pruning in selective > > filter queries. > > > > My high level question is why metadata summary files were marked > > deprecated in this Parquet changeset? There isn't much explanation given > > or a description of what should be used instead. > > https://github.com/apache/parquet-mr/pull/429 > > > > There are other members of the broader parquet community that are also > > confused by this deprecation, see this discussion in an arrow PR. > > https://github.com/apache/arrow/pull/4166 > > > > In the course of making my small prototype I got an extra performance > > boost by making spark write out metadata summary files, rather than > having > > to read all footers on the driver. This effect would be even more > > pronounced on a completely remote storage system like S3. Writing these > > summary files was disabled by default in SPARK-15719, because of the > > performance impact of appending a small number of new files to an > existing > > dataset with many files. > > > > https://issues.apache.org/jira/browse/SPARK-15719 > > > > This spark JIRA does make decent points considering how spark operates > > today, but I think that there is a performance optimization opportunity > > that is missed because the row group pruning is deferred to a bunch of > > separate short lived tasks rather than done upfront, currently spark only > > uses footers on the driver for schema merging. > > > > Thanks for the help! > > Jason Altekruse > > >