Hi,

Thanks for the explanation, Ryan. Considering the problems and alternatives
you listed plus the total lack of documentation, I wonder whether summary
files have ever been put into use or did they remain an unused feature ever
since? Is anyone aware of summary files being used in some system?

Thanks,

Zoltan

On Thu, Jul 27, 2017 at 7:52 PM Ryan Blue <[email protected]> wrote:

> Summary files contain merged footers of other files in the same directory.
> Sometimes that includes row group information to help plan. The idea was to
> use a single file scan instead of reading the footers of a lot of files to
> plan a job, which is expensive because you have to open all those files,
> get the footer start location, backward seek to it, etc. When job planning
> required reading all of the footers on the MR client or Spark driver, this
> helped reduce the planning time in some cases.
>
> The main problem with summary files is that they are difficult to maintain.
> If the summary file is missing data that was appended to a table later,
> then it can cause correctness problems. The long-term solution to job
> planning was to use Hadoop InputSplit planning (with no knowledge of row
> groups) and have tasks map row groups to splits on the task side (this is
> what the other formats do). That way, the work is distributed and
> everything goes much faster.
>
> There is also a second use case, which is that a summary file can provide
> the schema for a group of files. This can also get out of date and cause
> problems. The solution is to use a metastore to maintain the canonical
> schema for a table.
>
> rb
>
> On Thu, Jul 27, 2017 at 7:03 AM, Zoltan Ivanfi <[email protected]> wrote:
>
> > Hi,
> >
> > I came across some references to so-called "summary files" in
> > ParquetFileReader.java
> > <https://github.com/apache/parquet-mr/blob/master/parquet-
> > hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java>.
> > I wanted to find out what they are, but could hardly find any information
> > on the Internet. From the source code it seems that they replicate a
> > Parquet file's footer in a separate file, but I couldn't find them
> > mentioned in any documentation. I found this JIRA
> > <https://issues.apache.org/jira/browse/SPARK-15719> about disabling them
> > in
> > Spark because they were not considered useful.
> >
> > Are summary files obsolete or are they still in use? What is their
> intended
> > use? Are they documented somewhere?
> >
> > Thanks,
> >
> > Zoltan
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to