Ryan Blue created PARQUET-139:
---------------------------------
Summary: Avoid reading file footers in parquet-avro InputFormat
Key: PARQUET-139
URL: https://issues.apache.org/jira/browse/PARQUET-139
Project: Parquet
Issue Type: Task
Reporter: Ryan Blue
The AvroParquetInputFormat currently relies on the ParquetInputFormat that
reads the footers for all of the files that will be processed. This is for two
reasons:
1. To plan splits (if using client side splits)
2. To get a merged schema for all of the files
Reading all of the footers is a bottle-neck when working with a large number of
files and can significantly delay a job because only one machine is working.
This should be done in parallel on the task side. PARQUET-84 added the ability
to avoid reading footers on the client for split planning, so the difficult
task is to avoid reading footers to merge the Parquet schema.
To avoid merging the Parquet schema, the AvroParquetInputFormat should either
use whatever schema a file contains or should reconcile the projection schema
with the file schema on the task side.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)