GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/1370
[SPARK-2119][SQL] Improved Parquet performance when reading off S3
JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119)
Essentially this PR fixed three issues to gain much better performance when
reading large Parquet file off S3.
1. When reading the schema, fetching Parquet metadata from a part-file
rather than the `_metadata` file
The `_metadata` file contains metadata of all row groups, and can be
very large if there are many row groups. Since schema information and row group
metadata are coupled within a single Thrift object, we have to read the whole
`_metadata` to fetch the schema. On the other hand, schema is replicated among
footers of all part-files, which are fairly small.
1. Only add the root directory of the Parquet file rather than all the
part-files to input paths
HDFS API can automatically filter out all hidden files and underscore
files (`_SUCCESS` & `_metadata`), there's no need to filter out all part-files
and add them individually to input paths. What make it much worse is that,
`FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each
individual input path sequentially, each results a blocking remote S3 HTTP
request.
1. Worked around
[PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16)
Essentially PARQUET-16 is similar to the above issue, and results lots
of sequential `FileSystem.getFileStatus()` calls, which are further translated
into a bunch of remote S3 HTTP requests.
`FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is
fixed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark faster-parquet
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1370.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1370
----
commit 5bd3d29f9fa118719c94d1f5acffa24d6f1a755d
Author: Cheng Lian <[email protected]>
Date: 2014-07-06T04:59:01Z
Fixed Parquet log level
commit 1c0d1b923a57fddd1fe67270c71e28ac0324de04
Author: Cheng Lian <[email protected]>
Date: 2014-07-09T01:53:38Z
Accelerated Parquet schema retrieving
commit d2c4417a45dff48ad52a830695f9d68f9ed8531f
Author: Cheng Lian <[email protected]>
Date: 2014-07-10T20:17:57Z
Worked around PARQUET-16 to improve Parquet performance
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---