Uwe L. Korn created DRILL-4976:
----------------------------------
Summary: Querying Parquet files on S3 pulls
Key: DRILL-4976
URL: https://issues.apache.org/jira/browse/DRILL-4976
Project: Apache Drill
Issue Type: Improvement
Components: Storage - Parquet
Affects Versions: 1.8.0
Reporter: Uwe L. Korn
Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored
in S3, the underlying implementation of s3a requests magnitudes too much data.
Given sufficient seek sizes, the following HTTP pattern is observed:
* GET bytes=8k-100M
* GET bytes=2M-100M
* GET bytes=4M-100M
Although the HTTP request were normally aborted before all the data was
send by the server, it was still about 10-15x the size of the input files
that went over the network, i.e. for a file of the size of 100M, sometimes 1G
of data is transferred over the network.
A fix for this is the newly introduced
{{fs.s3a.experimental.input.fadvise=random}} mode which will be introduced with
Hadoop 3.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)