[
https://issues.apache.org/jira/browse/IMPALA-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17979263#comment-17979263
]
Steve Loughran commented on IMPALA-8523:
----------------------------------------
latest versions callers are encouraged to declare file format as a list, e. g
"parquet, vectored, random, adaptive:; first one used is picked up. If impala
doesn't use adaptive, cut that out.
the new analytics stream will use the "parquet" tag to read the footer, cache
its structure and potentially prefetch rowgroups based on ongoing reads.
> Migrate hdfsOpen to builder-based openFile API
> ----------------------------------------------
>
> Key: IMPALA-8523
> URL: https://issues.apache.org/jira/browse/IMPALA-8523
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Sahil Takiar
> Priority: Major
>
> When opening files via libhdfs we call {{hdfsOpen}} which ultimately calls
> {{FileSystem#open(Path f, int bufferSize)}}. As of HADOOP-15229, the
> HDFS-client now exposes a new API for opening files called {{openFile}}. The
> new API has a few advantages (1) it is capable of specifying file specific
> configuration values in a builder-based manner (see {{o.a.h.fs.FSBuilder}}
> for details), and (2) it can open files asynchronously (e.g. see
> {{o.a.h.fs.FutureDataInputStreamBuilder}} for details.
> The async file opens are similar to IMPALA-7738 (Implement timeouts for HDFS
> open calls). To avoid overlap between IMPALA-7738 and the async file opens in
> {{openFile}}, HADOOP-15691 can be used to check which filesystems open files
> asynchronously and which ones don't (currently only S3A opens files
> asynchronously).
> The main use case for the new {{openFile}} API is Impala-S3 performance.
> Performance benchmarks have shown that setting
> {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet files can
> significantly improve performance, however, this setting also adversely
> affects scans of non-splittable file formats such as gzipped files (see
> HADOOP-13203). One solution to this issue is to just document that setting
> {{fs.s3a.experimental.input.fadvise}} to {{RANDOM}} for Parquet improves
> performance, however, a better solution would be to use the new {{openFile}}
> API to specify different values of fadvise depending on the file type.
> This work is dependent on exposing the new {{openFile}} API via libhdfs
> (HDFS-14478).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]