[ 
https://issues.apache.org/jira/browse/PARQUET-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated PARQUET-16:
------------------------------

    Description: 
When testing Spark SQL Parquet support, we found that accessing large Parquet 
files located in S3 can be very slow. To be more specific, we have a S3 Parquet 
file with over 3,000 part-files, calling {{ParquetInputFormat.getSplits}} on it 
takes several minutes. (We were accessing this file from our office network 
rather than AWS.)

After some investigation, we found that {{ParquetInputFormat.getSplits}} is 
trying to call {{getFileStatus()}} on all part-files one by one sequentially 
([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L370]).
 And in the case of S3, each {{getFileStatus()}} call issues an HTTP request 
and wait for the reply in a blocking manner, which is considerably expensive.

Actually all these {{FileStatus}} objects have already been fetched when 
footers are retrieved 
([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L443]).
 Caching these {{FileStatus}} objects can greatly improve our S3 case (reduced 
from over 5 minutes to about 1.4 minutes).

Will submit a PR for this issue soon.

  was:
When testing Spark SQL Parquet support, we found that accessing large Parquet 
files located in S3 can be very slow. To be more specific, we have a S3 Parquet 
file with over 3,000 part-files, calling {{ParquetInputFormat.getSplits}} on it 
takes several minutes. (We were accessing this file from our office network 
rather than AWS.)

After some investigation, we found that {{ParquetInputFormat.getSplits}} is 
trying to call {{getFileStatus()}} to get the {{FileStatus}} object of all 
part-files one by one sequentially 
([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L370]).
 And in the case of S3, each {{getFileStatus()}} call issues an HTTP request 
and wait for the reply in a blocking manner, which is considerably expensive.

Actually all these {{FileStatus}} objects have already been fetched when 
footers are retrieved 
([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L443]).
 Caching these {{FileStatus}} objects can greatly improve our S3 case (reduced 
from over 5 minutes to about 1.4 minutes).

Will submit a PR for this issue soon.


> Unnecessary getFileStatus() calls on all part-files in 
> ParquetInputFormat.getSplits
> -----------------------------------------------------------------------------------
>
>                 Key: PARQUET-16
>                 URL: https://issues.apache.org/jira/browse/PARQUET-16
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>            Reporter: Cheng Lian
>
> When testing Spark SQL Parquet support, we found that accessing large Parquet 
> files located in S3 can be very slow. To be more specific, we have a S3 
> Parquet file with over 3,000 part-files, calling 
> {{ParquetInputFormat.getSplits}} on it takes several minutes. (We were 
> accessing this file from our office network rather than AWS.)
> After some investigation, we found that {{ParquetInputFormat.getSplits}} is 
> trying to call {{getFileStatus()}} on all part-files one by one sequentially 
> ([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L370]).
>  And in the case of S3, each {{getFileStatus()}} call issues an HTTP request 
> and wait for the reply in a blocking manner, which is considerably expensive.
> Actually all these {{FileStatus}} objects have already been fetched when 
> footers are retrieved 
> ([here|https://github.com/apache/incubator-parquet-mr/blob/parquet-1.5.0/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java#L443]).
>  Caching these {{FileStatus}} objects can greatly improve our S3 case 
> (reduced from over 5 minutes to about 1.4 minutes).
> Will submit a PR for this issue soon.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to