[
https://issues.apache.org/jira/browse/IMPALA-8490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841820#comment-16841820
]
Alex Rodoni commented on IMPALA-8490:
-------------------------------------
[~stakiar] Does the below sentence apply to S3 file handle caching?
The feature is enabled by default with 20,000 file handles to be cached. To
change the value, set the configuration option max_cached_file_handles to a
non-zero value for each impalad daemon
> Impala Doc: the file handle cache now supports S3
> -------------------------------------------------
>
> Key: IMPALA-8490
> URL: https://issues.apache.org/jira/browse/IMPALA-8490
> Project: IMPALA
> Issue Type: Sub-task
> Components: Docs
> Reporter: Sahil Takiar
> Assignee: Alex Rodoni
> Priority: Major
> Labels: future_release_doc, in_33
>
> https://impala.apache.org/docs/build/html/topics/impala_scalability.html
> state:
> {quote}
> Because this feature only involves HDFS data files, it does not apply to
> non-HDFS tables, such as Kudu or HBase tables, or tables that store their
> data on cloud services such as S3 or ADLS.
> {quote}
> This section should be updated because the file handle cache now supports S3
> files.
> We should add a section to the docs similar to what we added when support for
> remote HDFS files was added to the file handle cache:
> {quote}
> In Impala 3.2 and higher, file handle caching also applies to remote HDFS
> file handles. This is controlled by the cache_remote_file_handles flag for an
> impalad. It is recommended that you use the default value of true as this
> caching prevents your NameNode from overloading when your cluster has many
> remote HDFS reads.
> {quote}
> Like {{cache_remote_file_handles}} the flag {{cache_s3_file_handles}} has
> been added as an impalad startup option (the flag is enabled by default).
> Unlike HDFS though, S3 has no NameNode, the benefit is that it eliminate a
> call to {{getFileStatus()}} on the target S3 file. So "prevents your NameNode
> from overloading when your cluster has many remote HDFS reads" should be
> changed to something like "avoids an unnecessary call to
> S3AFileSystem#getFileStatus() which reduces the number of API calls made to
> S3."
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]