They were written by default by the output format for a long time and Spark used them. Spark is now moving away from it, but people definitely use them, if only by accident.
On Tue, Aug 1, 2017 at 4:57 AM, Zoltan Ivanfi <[email protected]> wrote: > Hi, > > Thanks for the explanation, Ryan. Considering the problems and > alternatives you listed plus the total lack of documentation, I wonder > whether summary files have ever been put into use or did they remain an > unused feature ever since? Is anyone aware of summary files being used in > some system? > > Thanks, > > Zoltan > > On Thu, Jul 27, 2017 at 7:52 PM Ryan Blue <[email protected]> > wrote: > >> Summary files contain merged footers of other files in the same directory. >> Sometimes that includes row group information to help plan. The idea was >> to >> use a single file scan instead of reading the footers of a lot of files to >> plan a job, which is expensive because you have to open all those files, >> get the footer start location, backward seek to it, etc. When job planning >> required reading all of the footers on the MR client or Spark driver, this >> helped reduce the planning time in some cases. >> >> The main problem with summary files is that they are difficult to >> maintain. >> If the summary file is missing data that was appended to a table later, >> then it can cause correctness problems. The long-term solution to job >> planning was to use Hadoop InputSplit planning (with no knowledge of row >> groups) and have tasks map row groups to splits on the task side (this is >> what the other formats do). That way, the work is distributed and >> everything goes much faster. >> >> There is also a second use case, which is that a summary file can provide >> the schema for a group of files. This can also get out of date and cause >> problems. The solution is to use a metastore to maintain the canonical >> schema for a table. >> >> rb >> >> On Thu, Jul 27, 2017 at 7:03 AM, Zoltan Ivanfi <[email protected]> wrote: >> >> > Hi, >> > >> > I came across some references to so-called "summary files" in >> > ParquetFileReader.java >> > <https://github.com/apache/parquet-mr/blob/master/parquet- >> > hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java>. >> > I wanted to find out what they are, but could hardly find any >> information >> > on the Internet. From the source code it seems that they replicate a >> > Parquet file's footer in a separate file, but I couldn't find them >> > mentioned in any documentation. I found this JIRA >> > <https://issues.apache.org/jira/browse/SPARK-15719> about disabling >> them >> > in >> > Spark because they were not considered useful. >> > >> > Are summary files obsolete or are they still in use? What is their >> intended >> > use? Are they documented somewhere? >> > >> > Thanks, >> > >> > Zoltan >> > >> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix
