[
https://issues.apache.org/jira/browse/DRILL-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe L. Korn updated DRILL-4976:
-------------------------------
Summary: Querying Parquet files on S3 pulls too much data (was: Querying
Parquet files on S3 pulls )
> Querying Parquet files on S3 pulls too much data
> -------------------------------------------------
>
> Key: DRILL-4976
> URL: https://issues.apache.org/jira/browse/DRILL-4976
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Parquet
> Affects Versions: 1.8.0
> Reporter: Uwe L. Korn
>
> Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored
> in S3, the underlying implementation of s3a requests magnitudes too much
> data. Given sufficient seek sizes, the following HTTP pattern is observed:
> * GET bytes=8k-100M
> * GET bytes=2M-100M
> * GET bytes=4M-100M
> Although the HTTP request were normally aborted before all the data was
> send by the server, it was still about 10-15x the size of the input files
> that went over the network, i.e. for a file of the size of 100M, sometimes 1G
> of data is transferred over the network.
> A fix for this is the newly introduced
> {{fs.s3a.experimental.input.fadvise=random}} mode which will be introduced
> with Hadoop 3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)