Summary files contain merged footers of other files in the same directory.
Sometimes that includes row group information to help plan. The idea was to
use a single file scan instead of reading the footers of a lot of files to
plan a job, which is expensive because you have to open all those files,
get the footer start location, backward seek to it, etc. When job planning
required reading all of the footers on the MR client or Spark driver, this
helped reduce the planning time in some cases.

The main problem with summary files is that they are difficult to maintain.
If the summary file is missing data that was appended to a table later,
then it can cause correctness problems. The long-term solution to job
planning was to use Hadoop InputSplit planning (with no knowledge of row
groups) and have tasks map row groups to splits on the task side (this is
what the other formats do). That way, the work is distributed and
everything goes much faster.

There is also a second use case, which is that a summary file can provide
the schema for a group of files. This can also get out of date and cause
problems. The solution is to use a metastore to maintain the canonical
schema for a table.

rb

On Thu, Jul 27, 2017 at 7:03 AM, Zoltan Ivanfi <[email protected]> wrote:

> Hi,
>
> I came across some references to so-called "summary files" in
> ParquetFileReader.java
> <https://github.com/apache/parquet-mr/blob/master/parquet-
> hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java>.
> I wanted to find out what they are, but could hardly find any information
> on the Internet. From the source code it seems that they replicate a
> Parquet file's footer in a separate file, but I couldn't find them
> mentioned in any documentation. I found this JIRA
> <https://issues.apache.org/jira/browse/SPARK-15719> about disabling them
> in
> Spark because they were not considered useful.
>
> Are summary files obsolete or are they still in use? What is their intended
> use? Are they documented somewhere?
>
> Thanks,
>
> Zoltan
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to