Sahil Takiar created IMPALA-8523:
------------------------------------
Summary: Migrate hdfsOpen to builder-based openFile API
Key: IMPALA-8523
URL: https://issues.apache.org/jira/browse/IMPALA-8523
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Sahil Takiar
Assignee: Sahil Takiar
When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls
{{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the
HDFS-client now exposes a new API for opening files called {{openFile}}. The
new API has a few advantages (1) it is capable of specifying file specific
configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}} for
details), and (2) it can open files asynchronously (e.g. see
{{o.a.h.fs.FutureDataInputStreamBuilder}} for details.
The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS
open calls). To avoid overlap between IMPALA-7738 and the async file opens in
{{openFile}}, HADOOP-15691 can be used to check which filesystems open files
asynchronously and which ones don't (currently only S3A opens files
asynchronously).
The main use case for the new {{openFile}} API is Impala-S3 performance.
Performance benchmarks have shown that setting
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can
significantly improve performance, however, this setting also adversely affects
scans of non-splittable file formats such as gzipped files (see HADOOP-13203).
One solution to this issue is to just document that setting
{{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves
performance, however, a better solution would be to use the new {{openFile}}
API to specify different values of fadvise depending on the file type.
This work is dependent on exposing the new {{openFile}} API via libhdfs
(HDFS-14478).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)