On 07/20/2015 10:54 AM, Alex Levenson wrote:
You can push the reading of the summary file to the mappers instead of
reading it on the submitter node:

ParquetInputFormat.setTaskSideMetaData(conf, true);

This is the default from 1.6.0 forward.

or setting "parquet.task.side.metadata" to true in your configuration. We
had a similar issue, by default the client reads the summary file on the
submitter node which takes a lot of time and memory. This flag fixes the
issue for us by instead reading each individual file's metadata from the
file footer in the mappers (each mapper reads only the metadata it needs).

Another option, which is something we've been talking about in the past, is
to disable creating this metadata file at all, as we've seen creating it
can be expensive too, and if you use the task side metadata approach, it's
never used.

There's an option to suppress the files, which I recommend. Now that file metadata is handled on the tasks, there's not much need for the summary files.

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to