Cheng Lian created PARQUET-16:
---------------------------------
Summary: Unnecessary getFileStatus() calls on all part-files in
ParquetInputFormat.getSplits
Key: PARQUET-16
URL: https://issues.apache.org/jira/browse/PARQUET-16
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Reporter: Cheng Lian
When testing Spark SQL Parquet support, we found that accessing large Parquet
files located in S3 can be very slow. To be more specific, we have a S3 Parquet
file with over 3,000 part-files, calling {{ParquetInputFormat.getSplits}} on it
takes several minutes. (We were accessing this file from our office network
rather than AWS.)
After some investigation, we found that {{ParquetInputFormat.getSplits}} is
trying to call {{getFileStatus()}} to get the {{FileStatus}} object of all
part-files one by one sequentially
([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L370]).
And in the case of S3, each {{getFileStatus()}} call issues an HTTP request
and wait for the reply in a blocking manner, which is considerably expensive.
Actually all these {{FileStatus}} objects have already been fetched when
footers are retrieved
([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L443]).
Caching these {{FileStatus}} objects can greatly improve our S3 case (reduced
from over 5 minutes to about 1.4 minutes).
Will submit a PR for this issue soon.
--
This message was sent by Atlassian JIRA
(v6.2#6252)