Re: What are summary files?

Ryan Blue Tue, 01 Aug 2017 09:10:27 -0700

They were written by default by the output format for a long time and Spark
used them. Spark is now moving away from it, but people definitely use
them, if only by accident.


On Tue, Aug 1, 2017 at 4:57 AM, Zoltan Ivanfi <[email protected]> wrote:

> Hi,
>
> Thanks for the explanation, Ryan. Considering the problems and
> alternatives you listed plus the total lack of documentation, I wonder
> whether summary files have ever been put into use or did they remain an
> unused feature ever since? Is anyone aware of summary files being used in
> some system?
>
> Thanks,
>
> Zoltan
>
> On Thu, Jul 27, 2017 at 7:52 PM Ryan Blue <[email protected]>
> wrote:
>
>> Summary files contain merged footers of other files in the same directory.
>> Sometimes that includes row group information to help plan. The idea was
>> to
>> use a single file scan instead of reading the footers of a lot of files to
>> plan a job, which is expensive because you have to open all those files,
>> get the footer start location, backward seek to it, etc. When job planning
>> required reading all of the footers on the MR client or Spark driver, this
>> helped reduce the planning time in some cases.
>>
>> The main problem with summary files is that they are difficult to
>> maintain.
>> If the summary file is missing data that was appended to a table later,
>> then it can cause correctness problems. The long-term solution to job
>> planning was to use Hadoop InputSplit planning (with no knowledge of row
>> groups) and have tasks map row groups to splits on the task side (this is
>> what the other formats do). That way, the work is distributed and
>> everything goes much faster.
>>
>> There is also a second use case, which is that a summary file can provide
>> the schema for a group of files. This can also get out of date and cause
>> problems. The solution is to use a metastore to maintain the canonical
>> schema for a table.
>>
>> rb
>>
>> On Thu, Jul 27, 2017 at 7:03 AM, Zoltan Ivanfi <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > I came across some references to so-called "summary files" in
>> > ParquetFileReader.java
>> > <https://github.com/apache/parquet-mr/blob/master/parquet-
>> > hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java>.
>> > I wanted to find out what they are, but could hardly find any
>> information
>> > on the Internet. From the source code it seems that they replicate a
>> > Parquet file's footer in a separate file, but I couldn't find them
>> > mentioned in any documentation. I found this JIRA
>> > <https://issues.apache.org/jira/browse/SPARK-15719> about disabling
>> them
>> > in
>> > Spark because they were not considered useful.
>> >
>> > Are summary files obsolete or are they still in use? What is their
>> intended
>> > use? Are they documented somewhere?
>> >
>> > Thanks,
>> >
>> > Zoltan
>> >
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: What are summary files?

Reply via email to