Sahil Takiar created IMPALA-8490:
------------------------------------
Summary: Impala Doc: the file handle cache now supports S3
Key: IMPALA-8490
URL: https://issues.apache.org/jira/browse/IMPALA-8490
Project: IMPALA
Issue Type: Sub-task
Reporter: Sahil Takiar
Assignee: Alex Rodoni
https://impala.apache.org/docs/build/html/topics/impala_scalability.html state:
{quote}
Because this feature only involves HDFS data files, it does not apply to
non-HDFS tables, such as Kudu or HBase tables, or tables that store their data
on cloud services such as S3 or ADLS.
{quote}
This section should be updated because the file handle cache now supports S3
files.
We should add a section to the docs similar to what we added when support for
remote HDFS files was added to the file handle cache:
{quote}
In Impala 3.2 and higher, file handle caching also applies to remote HDFS file
handles. This is controlled by the cache_remote_file_handles flag for an
impalad. It is recommended that you use the default value of true as this
caching prevents your NameNode from overloading when your cluster has many
remote HDFS reads.
{quote}
Like {{cache_remote_file_handles}} the flag {{cache_s3_file_handles}} has been
added as an impalad startup option (the flag is enabled by default).
Unlike HDFS though, S3 has no NameNode, the benefit is that it eliminate a call
to {{getFileStatus()}} on the target S3 file. So "prevents your NameNode from
overloading when your cluster has many remote HDFS reads" should be changed to
something like "avoids an unnecessary call to S3AFileSystem#getFileStatus()
which reduces the number of API calls made to S3."
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]