Hey Ryan,

Thanks for the response, I do not want to push the parquet community to
keep around features that will cause users headaches, but I am still
looking for the best solution to the problem I am facing.

One thing I could use some clarity on, given what I have seen in various
tools, I am actually not sure that there is a significant risk of wrong
results with one possible small recommendation in relation to these files.

I was not assuming that directories were immutable, and the method included
in parquet-mr that I referenced, readAllFootersInParallelUsingSummaryFiles,
also goes against the notion that this is a hard requirement. It
specifically takes in a file listing, which needs to be provided by some
external source, either the FS directly, or some transactional system like
delta in my case and actually uses the metadata summary file just as a
cache for footers, with explicit code to fall back to red any footers
missing from the summary directly from the FS itself.

It sounds like some projects in the past have used this file to avoid doing
an FS listing, which I absolutely agree isn't safe to do and will cause
problems when people copy in new files to a directory. Can we just document
that this practice is bad? And possibly just deprecate any code that reads
the summary file without this kind of fallback and an expectation that
callers pass in a list of files they expect to get footers for?

I also don't know if I have ever seen a project take advantage of the fact
that you can technically directly append to a parquet file by reading in
the previous footer, appending new row groups and writing out a whole new
footer with the new metadata combined with the old, leaving dead bytes in
the file where the old footer sat. I do remember discussing this
possibility with Julien at some point, but I don't know if parquet-mr or
any other projects actually have written code to do this. If this is done,
it would provide another way for the summary file to become stale, and this
would not be detectable with just knowing the filename, the summary would
need to contain file length info.

There is also the possibility that parquet files could be deleted and
rewritten in the same filenames, but this isn't common in any hadoop/spark
ecosystem projects I know of, they all generate unique filenames using
application IDs or GUIDs.

Jason Altekruse

On Tue, Sep 29, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I don't remember deprecating it, but I've always recommended against using
> it because of the assumptions it requires.
>
> Those assumptions are routinely violated by processing engines and users
> that expect to be able to drop files into directories and see the results
> in their table. Since this was a feature without guard rails or
> documentation to explain how to safely use it, I think it is a good idea to
> steer people away from it and deprecate it unless someone wants to address
> those concerns. Now, I think there are much better alternatives (thanks
> Jacques!) so I probably wouldn't recommend spending time on bringing this
> up to date and making it marginally safer.
>
> On Tue, Sep 29, 2020 at 11:38 AM Julien Le Dem <julien.le...@gmail.com>
> wrote:
>
> > Hi Jason,
> > Thank you for bringing this up.
> > A correctness issue would only come up when more parquet files are
> > added to the same folder or files are modified. Historically folders have
> > been considered immutables and the summary file reflects the metadata for
> > all the files in the folder. The summary file contains the names of the
> > files it is for, so extra files in the folder can also be detected and
> > dealt with at read time without correctness issues.
> > Like you mentioned the read path allows for those files to not be
> present.
> > I think a better solution than deprecating would be to have a switch
> > allowing turning off those summary files when one wants to be able to not
> > respect the immutable folder contact.
> > Projects like Iceberg can elect to not produce them and allow modifying
> > and adding more files to the same folder without creating correctness
> > problems.
> > I would be in favor of removing those Deprecated annotations and document
> > the use of a switch to optionally not produce the summary files when
> > electing to modify folders.
> > I'm curious to hear from Ryan about this who did the change in the first
> > place.
> > Best,
> > Julien
> >
> > On Fri, Sep 25, 2020 at 3:06 PM Jason Altekruse <
> altekruseja...@gmail.com>
> > wrote:
> >
> >> Hy Jacques,
> >>
> >> It's good to hear from you, thanks for the pointer to Iceberg. I am
> aware
> >> of it as well as other similar projects, including Delta Lake, which my
> >> team is already using. Unfortunately even with Delta, there is only a
> >> placeholder in the project currently where they will be tracking file
> >> level
> >> statistics at some point in the future, we are also evaluating the
> >> possibility of implementing this in delta itself. While it and Iceberg
> >> aren't quite the same architecturally, I think there is enough overlap
> >> that
> >> it might be a bit awkward to use the two in conjunction with one
> another.
> >>
> >> From my testing so far, it appears that delta pretty easily can operate
> >> alongside these older metadata summary files without the two fighting
> with
> >> each other. Delta is responsible for maintaining a transactionally
> >> consistent list of files, and this file can coexist in the directory
> just
> >> to allow efficient pruning on the driver side on a best effort basis, as
> >> it
> >> can gracefully fall back to the FS if it is missing a newer file.
> >>
> >> We are somewhat nervous about depending on something that is marked
> >> deprecated, but as it is so close to a "just works" state for our
> needs, I
> >> was hoping to confirm with the community if there were other risks I was
> >> missing.
> >>
> >> Jason Altekruse
> >>
> >> On Wed, Sep 23, 2020 at 6:29 PM Jacques Nadeau <jacq...@apache.org>
> >> wrote:
> >>
> >> > Hey Jason,
> >> >
> >> > I'd suggest you look at Apache Iceberg. It is a much more mature way
> of
> >> > handling metadata efficiency issues and provides a substantial
> superset
> >> of
> >> > functionality over the old metadata cache files.
> >> >
> >> > On Wed, Sep 23, 2020 at 4:16 PM Jason Altekruse <
> >> altekruseja...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hello again,
> >> > >
> >> > > I took a look through the mail archives and found a little more
> >> > information
> >> > > in this and a few other threads.
> >> > >
> >> > >
> >> > >
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox//parquet-dev/201707.mbox/%3CCAO4re1k8-bZZZWBRuLCnm1V7AoJE1fdunSuBn%2BecRuFGPgcXnA%40mail.gmail.com%3E
> >> > >
> >> > > While I do understand the benefits for federating out the reading of
> >> > > footers for the sake of not worrying about synchronization between
> the
> >> > > cached metadata and any changes to the files on disk, it does appear
> >> > there
> >> > > is still a use case that isn't solved well with this design, needle
> >> in a
> >> > > haystack selective filter queries, where the data is sorted by the
> >> filter
> >> > > column. For example in the tests I ran with queries against lots of
> >> > parquet
> >> > > files where the vast majority are pruned by a bunch of small tasks,
> it
> >> > > takes 33 seconds vs just 1-2 seconds with driver side pruning using
> >> the
> >> > > summary file (requires a small spark changet).
> >> > >
> >> > > In our use case we are never going to be replacing contents of
> >> existing
> >> > > parquet files (with a delete and rewrite on HDFS) or appending new
> row
> >> > > groups onto existing files. In that case I don't believe we should
> >> > > experience any correctness problems, but I wanted to confirm if
> there
> >> is
> >> > > something I am missing. I am
> >> > > using readAllFootersInParallelUsingSummaryFiles which does fall back
> >> to
> >> > > read individual footers if they are missing from the summary file.
> >> > >
> >> > > I am also curious if a solution to the correctness problems could be
> >> to
> >> > > include a file length and/or last modified time into the summary
> file,
> >> > > which could confirm against FS metadata that the files on disk are
> >> still
> >> > in
> >> > > sync with the metadata summary relatively quickly. Would it be
> >> possible
> >> > to
> >> > > consider avoiding this deprecation if I was to work on an update to
> >> > > implement this?
> >> > >
> >> > > - Jason Altekruse
> >> > >
> >> > >
> >> > > On Tue, Sep 15, 2020 at 8:52 PM Jason Altekruse <
> >> > altekruseja...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hello all,
> >> > > >
> >> > > > I have been working on optimizing reads in spark to avoid spinning
> >> up
> >> > > lots
> >> > > > of short lived tasks that just perform row group pruning in
> >> selective
> >> > > > filter queries.
> >> > > >
> >> > > > My high level question is why metadata summary files were marked
> >> > > > deprecated in this Parquet changeset? There isn't much explanation
> >> > given
> >> > > > or a description of what should be used instead.
> >> > > > https://github.com/apache/parquet-mr/pull/429
> >> > > >
> >> > > > There are other members of the broader parquet community that are
> >> also
> >> > > > confused by this deprecation, see this discussion in an arrow PR.
> >> > > > https://github.com/apache/arrow/pull/4166
> >> > > >
> >> > > > In the course of making my small prototype I got an extra
> >> performance
> >> > > > boost by making spark write out metadata summary files, rather
> than
> >> > > having
> >> > > > to read all footers on the driver. This effect would be even more
> >> > > > pronounced on a completely remote storage system like S3. Writing
> >> these
> >> > > > summary files was disabled by default in SPARK-15719, because of
> the
> >> > > > performance impact of appending a small number of new files to an
> >> > > existing
> >> > > > dataset with many files.
> >> > > >
> >> > > > https://issues.apache.org/jira/browse/SPARK-15719
> >> > > >
> >> > > > This spark JIRA does make decent points considering how spark
> >> operates
> >> > > > today, but I think that there is a performance optimization
> >> opportunity
> >> > > > that is missed because the row group pruning is deferred to a
> bunch
> >> of
> >> > > > separate short lived tasks rather than done upfront, currently
> spark
> >> > only
> >> > > > uses footers on the driver for schema merging.
> >> > > >
> >> > > > Thanks for the help!
> >> > > > Jason Altekruse
> >> > > >
> >> > >
> >> >
> >>
> >
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to