You can push the reading of the summary file to the mappers instead of
reading it on the submitter node:

ParquetInputFormat.setTaskSideMetaData(conf, true);

or setting "parquet.task.side.metadata" to true in your configuration. We
had a similar issue, by default the client reads the summary file on the
submitter node which takes a lot of time and memory. This flag fixes the
issue for us by instead reading each individual file's metadata from the
file footer in the mappers (each mapper reads only the metadata it needs).

Another option, which is something we've been talking about in the past, is
to disable creating this metadata file at all, as we've seen creating it
can be expensive too, and if you use the task side metadata approach, it's
never used.

On Mon, Jul 20, 2015 at 10:40 AM, Yan Qi <[email protected]> wrote:

> Right now we are using a MapReduce job to convert some data and store the
> result in the Parquet format. The size can be tens of terabytes, leading to
> a pretty large summary file (i.e., _metadata).
>
> When we try to use another MapReduce job to read the result, it takes
> forever to load the metadata.
>
> We are wondering if it is possible to reduce (ideally eliminate) the cost
> of loading the summary file while staring a MR job.
>
>
> Thanks,
>
> Yan
>



-- 
Alex Levenson
@THISWILLWORK

Reply via email to