Sahil Takiar created IMPALA-8490:
------------------------------------

             Summary: Impala Doc: the file handle cache now supports S3
                 Key: IMPALA-8490
                 URL: https://issues.apache.org/jira/browse/IMPALA-8490
             Project: IMPALA
          Issue Type: Sub-task
            Reporter: Sahil Takiar
            Assignee: Alex Rodoni


https://impala.apache.org/docs/build/html/topics/impala_scalability.html state:

{quote}
Because this feature only involves HDFS data files, it does not apply to 
non-HDFS tables, such as Kudu or HBase tables, or tables that store their data 
on cloud services such as S3 or ADLS.
{quote}

This section should be updated because the file handle cache now supports S3 
files.

We should add a section to the docs similar to what we added when support for 
remote HDFS files was added to the file handle cache:

{quote}
In Impala 3.2 and higher, file handle caching also applies to remote HDFS file 
handles. This is controlled by the cache_remote_file_handles flag for an 
impalad. It is recommended that you use the default value of true as this 
caching prevents your NameNode from overloading when your cluster has many 
remote HDFS reads.
{quote}

Like {{cache_remote_file_handles}} the flag {{cache_s3_file_handles}} has been 
added as an impalad startup option (the flag is enabled by default).

Unlike HDFS though, S3 has no NameNode, the benefit is that it eliminate a call 
to {{getFileStatus()}} on the target S3 file. So "prevents your NameNode from 
overloading when your cluster has many remote HDFS reads" should be changed to 
something like "avoids an unnecessary call to S3AFileSystem#getFileStatus() 
which reduces the number of API calls made to S3."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to