This is an automated email from the ASF dual-hosted git repository.

michaelsmith pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 7dcf80b32e207c1078bed7aca1714ce59d2afe13
Author: Shajini Thayasingh <[email protected]>
AuthorDate: Fri Feb 3 10:40:43 2023 -0800

    IMPALA-10804: [DOCS] Document spill to remote storage
    
    Spill to HDFS, S3, and Ozone.
    
    Change-Id: I3efb2ffcc06cdbe69845c6dc4cf03d9f2e3dcabc
    Reviewed-on: http://gerrit.cloudera.org:8080/19472
    Reviewed-by: Yida Wu <[email protected]>
    Tested-by: Impala Public Jenkins <[email protected]>
---
 docs/topics/impala_disk_space.xml | 106 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)

diff --git a/docs/topics/impala_disk_space.xml 
b/docs/topics/impala_disk_space.xml
index b32502ff1..4440f7763 100644
--- a/docs/topics/impala_disk_space.xml
+++ b/docs/topics/impala_disk_space.xml
@@ -343,6 +343,112 @@ under the License.
       <p> Compression levels from 1 up to 22 (default 3) are supported for 
<codeph>ZSTD</codeph>.
         The lower the compression level, the faster the speed at the cost of 
compression ratio.</p>
     </section>
+    <section>
+      <title>Configure Impala Daemon to spill to S3</title>
+      <p>Impala occasionally needs to use persistent storage for writing 
intermediate files during
+        large sorts, joins, aggregations, or analytic function operations. If 
your workload results
+        in large volumes of intermediate data being written, it is recommended 
to configure the
+        heavy spilling queries to use a remote storage location rather than 
the local one. The
+        advantage of using remote storage for scratch space is that it is 
elastic and can handle any
+        amount of spilling.</p>
+      <p><b>Before you begin</b></p>
+      <p>Identify the URL for an S3 bucket to which you want your new Impala 
to write the temporary
+        data. If you use the S3 bucket that is associated with the 
environment, navigate to the S3
+        bucket and copy the URL. If you want to use an external S3 bucket, you 
must first configure
+        your environment to use the external S3 bucket with the correct 
read/write permissions.</p>
+      <p><b>Configuring the Start-up Option in Impala daemon</b></p>
+      <p>You can use the Impalad start option scratch_dirs to specify the 
locations of the
+        intermediate files. The format of the option is <codeph>scratch_dirs= 
remote_dir, local_buffer_dir(,
+          local_dir…).</codeph></p>
+      <p>With the option specified above:</p>
+      <ul>
+        <li>You can specify only one remote directory. When you configure a 
remote directory, you
+          must specify a local buffer directory as the buffer. However you can 
use multiple local
+          directories with the remote directory. If you specify multiple local 
directories, the
+          first local directory would be used as the local buffer 
directory.</li>
+        <li>If you configure both remote and local directories, the remote 
directory is only used
+          when the local directories are fully utilized.</li>
+        <li>The size of a remote intermediate file could affect the query 
performance, and the value
+          can be set by <codeph>>remote_tmp_file_size</codeph> in the start-up 
option. The default
+          size of a remote intermediate file is 16MB while the maximum is 
256MB.</li>
+      </ul>
+      <p><b>Examples</b></p>
+      <ul>
+        <li>A remote scratch dir with one local buffer dir, file size 64MB.
+          <codeblock>‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir" 
‑‑remote_tmp_file_size=64M</codeblock></li>
+        <li>A remote scratch dir with one local buffer dir, and one local dir.
+          <codeblock>‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir, 
/local_dir"</codeblock></li>
+        <li>A remote scratch dir with one local buffer dir, and multiple local 
dirs.
+          <codeblock>‑‑scratch_dirs="s3a://remote_dir, /local_buffer_dir, 
/local_dir_1, /local_dir_2"</codeblock></li>
+      </ul>
+    </section>
+    <section>
+      <title>Configure Impala Daemon to spill to HDFS</title>
+      <p>Impala occasionally needs to use persistent storage for writing 
intermediate files during
+        large sorts, joins, aggregations, or analytic function operations. If 
your workload results
+        in large volumes of intermediate data being written, it is recommended 
to configure the
+        heavy spilling queries to use a remote storage location rather than 
the local one. The
+        advantage of using remote storage for scratch space is that it is 
elastic and can handle any
+        amount of spilling.</p>
+      <p><b>Before you begin</b></p>
+      <ul>
+        <li>Identify the HDFS scratch directory where you want your new Impala 
to write the
+          temporary data.</li>
+        <li>Identify the port number of the HDFS scratch directory.</li>
+        <li>Configure Impala to write temporary data to disk during query 
processing.</li>
+      </ul>
+      <p><b>Configuring the Start-up Option in Impala daemon</b></p>
+      <p>You can use the Impalad start option “scratch_dirs” to specify the 
locations of the
+        intermediate files.</p>
+      <p>Use the following format for this start up option:</p>
+      
<codeblock>‑‑scratch_dirs=”hdfs://ip_address:port_num(:max_bytes)(:priority), 
/local_buffer_dir” ‑‑remote_tmp_file_size=xM</codeblock>
+      <ul>
+        <li>Where 
<codeph>“hdfs://ip_address:port_num/path(:max_bytes)(:priority)”</codeph> is 
the remote
+          directory.</li>
+        <li><codeph>port_num</codeph> is required for the HDFS scratch 
directory.</li>
+        <li><codeph>max_bytes</codeph> and <codeph>priority</codeph> are 
optional.</li>
+      </ul>
+      <p>Using the above format:</p>
+      <ul>
+        <li>You can specify only one remote directory.</li>
+        <li>When you configure a remote directory, you must specify a local 
buffer directory as the
+          buffer. However you can use multiple local directories with the 
remote directory. If you
+          specify multiple local directories, the first local directory would 
be used as the local
+          buffer directory.</li>
+        <li>If you configure both remote and local directories, the remote 
directory is only used
+          when the local directories are fully utilized.</li>
+        <li>The size of a remote intermediate file could affect the query 
performance, and the value
+          can be set by “remote_tmp_file_size” in the start-up option. The 
default size of a remote
+          intermediate file is 16MB while the maximum is 512MB.</li>
+      </ul>
+      <p><b>Examples</b></p>
+      <ul>
+        <li>A hdfs scratch dir with one local buffer dir, file size 64MB. The 
space of hdfs scratch
+          dir is limited to 300G.
+          <codeblock>‑‑scratch_dirs="hdfs://ip_address:port_num/path:300G, 
/local_buffer_dir" ‑‑remote_tmp_file_size=64M</codeblock></li>
+        <li>A hdfs scratch dir with one local buffer dir, and one local dir. 
The space of hdfs
+          scratch dir is limited to 300G.
+          <codeblock>‑‑scratch_dirs="hdfs://ip_address:port_num/path:300G, 
/local_buffer_dir, /local_dir"</codeblock></li>
+        <li>A hdfs scratch dir with one local buffer dir, and multiple local 
dirs. The space of hdfs
+          scratch dir is unlimited.
+          <codeblock>‑‑scratch_dirs="hdfs://ip_address:port_num/path, 
/local_buffer_dir, /local_dir_1, /local_dir_2"</codeblock></li>
+      </ul>
+      <p>Even though max_bytes is optional it is highly recommended to 
configure for spilling to
+        HDFS because the HDFS cluster space is limited.</p>
+    </section>
+    <section>
+      <title>Configure Impala Daemon to spill to Ozone</title>
+      <p><b>Before you begin</b></p>
+      <ul>
+        <li>Identify the Ozone scratch directory where you want your new 
Impala to write the
+          temporary data.</li>
+        <li>Identify the port number of the Ozone scratch directory.</li>
+      </ul>
+      <p><b>Configuring the Start-up Option in Impala daemon</b></p>
+      <p>You can use the Impalad start option “scratch_dirs” to specify the 
locations of the
+        intermediate files.</p>
+      
<codeblock>‑‑scratch_dirs=”ofs://ip_address:port_num(:max_bytes)(:priority), 
/local_buffer_dir” ‑‑remote_tmp_file_size=xM</codeblock>
+    </section>
   </conbody>
 
 </concept>

Reply via email to