http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_breakpad.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_breakpad.html b/docs/build/html/topics/impala_breakpad.html new file mode 100644 index 0000000..7e05497 --- /dev/null +++ b/docs/build/html/topics/impala_breakpad.html @@ -0,0 +1,223 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_troubleshooting.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="breakpad"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Breakpad Minidumps for Impala (Impala 2.6 or higher only)</title></head><body id="breakpad"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Breakpad Minidumps for Impala (<span class="keyword">Impala 2.6</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + The <a class="xref" href="https://chromium.googlesource.com/breakpad/breakpad/" target="_blank">breakpad</a> + project is an open-source framework for crash reporting. + In <span class="keyword">Impala 2.6</span> and higher, Impala can use <code class="ph codeph">breakpad</code> to record stack information and + register values when any of the Impala-related daemons crash due to an error such as <code class="ph codeph">SIGSEGV</code> + or unhandled exceptions. + The dump files are much smaller than traditional core dump files. The dump mechanism itself uses very little + memory, which improves reliability if the crash occurs while the system is low on memory. + </p> + + <div class="note important note_important"><span class="note__title importanttitle">Important:</span> + Because of the internal mechanisms involving Impala memory allocation and Linux + signalling for out-of-memory (OOM) errors, if an Impala-related daemon experiences a + crash due to an OOM condition, it does <em class="ph i">not</em> generate a minidump for that error. + <p class="p"> + + </p> + </div> + + + <p class="p toc inpage"></p> + + </div> + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_troubleshooting.html">Troubleshooting Impala</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="breakpad__breakpad_minidump_enable"> + <h2 class="title topictitle2" id="ariaid-title2">Enabling or Disabling Minidump Generation</h2> + <div class="body conbody"> + <p class="p"> + By default, a minidump file is generated when an Impala-related daemon crashes. + To turn off generation of the minidump files, change the + <span class="ph uicontrol">minidump_path</span> configuration setting of one or more Impala-related daemons + to the empty string, and restart the corresponding services or daemons. + </p> + + <p class="p"> + In <span class="keyword">Impala 2.7</span> and higher, + you can send a <code class="ph codeph">SIGUSR1</code> signal to any Impala-related daemon to write a + Breakpad minidump. For advanced troubleshooting, you can now produce a minidump + without triggering a crash. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="breakpad__breakpad_minidump_location"> + <h2 class="title topictitle2" id="ariaid-title3">Specifying the Location for Minidump Files</h2> + <div class="body conbody"> + <div class="p"> + By default, all minidump files are written to the following location + on the host where a crash occurs: + + <ul class="ul"> + <li class="li"> + <p class="p"> + Clusters not managed by cluster management software: + <span class="ph filepath"><var class="keyword varname">impala_log_dir</var>/<var class="keyword varname">daemon_name</var>/minidumps/<var class="keyword varname">daemon_name</var></span> + </p> + </li> + </ul> + The minidump files for <span class="keyword cmdname">impalad</span>, <span class="keyword cmdname">catalogd</span>, + and <span class="keyword cmdname">statestored</span> are each written to a separate directory. + </div> + <p class="p"> + To specify a different location, set the + + <span class="ph uicontrol">minidump_path</span> + configuration setting of one or more Impala-related daemons, and restart the corresponding services or daemons. + </p> + <p class="p"> + If you specify a relative path for this setting, the value is interpreted relative to + the default <span class="ph uicontrol">minidump_path</span> directory. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="breakpad__breakpad_minidump_number"> + <h2 class="title topictitle2" id="ariaid-title4">Controlling the Number of Minidump Files</h2> + <div class="body conbody"> + <p class="p"> + Like any files used for logging or troubleshooting, consider limiting the number of + minidump files, or removing unneeded ones, depending on the amount of free storage + space on the hosts in the cluster. + </p> + <p class="p"> + Because the minidump files are only used for problem resolution, you can remove any such files that + are not needed to debug current issues. + </p> + <p class="p"> + To control how many minidump files Impala keeps around at any one time, + set the <span class="ph uicontrol">max_minidumps</span> configuration setting for + of one or more Impala-related daemon, and restart the corresponding services or daemons. + The default for this setting is 9. A zero or negative value is interpreted as + <span class="q">"unlimited"</span>. + </p> + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title5" id="breakpad__breakpad_minidump_logging"> + <h2 class="title topictitle2" id="ariaid-title5">Detecting Crash Events</h2> + <div class="body conbody"> + + <p class="p"> + You can see in the Impala log files when crash events occur that generate + minidump files. Because each restart begins a new log file, the <span class="q">"crashed"</span> message + is always at or near the bottom of the log file. There might be another later message + if core dumps are also enabled. + </p> + + </div> + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="breakpad__breakpad_demo"> + <h2 class="title topictitle2" id="ariaid-title6">Demonstration of Breakpad Feature</h2> + <div class="body conbody"> + <p class="p"> + The following example uses the command <span class="keyword cmdname">kill -11</span> to + simulate a <code class="ph codeph">SIGSEGV</code> crash for an <span class="keyword cmdname">impalad</span> + process on a single DataNode, then examines the relevant log files and minidump file. + </p> + + <p class="p"> + First, as root on a worker node, kill the <span class="keyword cmdname">impalad</span> process with a + <code class="ph codeph">SIGSEGV</code> error. The original process ID was 23114. + </p> + +<pre class="pre codeblock"><code> +# ps ax | grep impalad +23114 ? Sl 0:18 /opt/local/parcels/<parcel_version>/lib/impala/sbin/impalad --flagfile=/var/run/impala/process/114-impala-IMPALAD/impala-conf/impalad_flags +31259 pts/0 S+ 0:00 grep impalad +# +# kill -11 23114 +# +# ps ax | grep impalad +31374 ? Rl 0:04 /opt/local/parcels/<parcel_version>/lib/impala/sbin/impalad --flagfile=/var/run/impala/process/114-impala-IMPALAD/impala-conf/impalad_flags +31475 pts/0 S+ 0:00 grep impalad + +</code></pre> + + <p class="p"> + We locate the log directory underneath <span class="ph filepath">/var/log</span>. + There is a <code class="ph codeph">.INFO</code>, <code class="ph codeph">.WARNING</code>, and <code class="ph codeph">.ERROR</code> + log file for the 23114 process ID. The minidump message is written to the + <code class="ph codeph">.INFO</code> file and the <code class="ph codeph">.ERROR</code> file, but not the + <code class="ph codeph">.WARNING</code> file. In this case, a large core file was also produced. + </p> +<pre class="pre codeblock"><code> +# cd /var/log/impalad +# ls -la | grep 23114 +-rw------- 1 impala impala 3539079168 Jun 23 15:20 core.23114 +-rw-r--r-- 1 impala impala 99057 Jun 23 15:20 hs_err_pid23114.log +-rw-r--r-- 1 impala impala 351 Jun 23 15:20 impalad.worker_node_123.impala.log.ERROR.20160623-140343.23114 +-rw-r--r-- 1 impala impala 29101 Jun 23 15:20 impalad.worker_node_123.impala.log.INFO.20160623-140343.23114 +-rw-r--r-- 1 impala impala 228 Jun 23 14:03 impalad.worker_node_123.impala.log.WARNING.20160623-140343.23114 + +</code></pre> + <p class="p"> + The <code class="ph codeph">.INFO</code> log includes the location of the minidump file, followed by + a report of a core dump. With the breakpad minidump feature enabled, now we might + disable core dumps or keep fewer of them around. + </p> +<pre class="pre codeblock"><code> +# cat impalad.worker_node_123.impala.log.INFO.20160623-140343.23114 +... +Wrote minidump to /var/log/impala-minidumps/impalad/0980da2d-a905-01e1-25ff883a-04ee027a.dmp +# +# A fatal error has been detected by the Java Runtime Environment: +# +# SIGSEGV (0xb) at pc=0x00000030c0e0b68a, pid=23114, tid=139869541455968 +# +# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) +# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) +# Problematic frame: +# C [libpthread.so.0+0xb68a] pthread_cond_wait+0xca +# +# Core dump written. Default location: /var/log/impalad/core or core.23114 +# +# An error report file with more information is saved as: +# /var/log/impalad/hs_err_pid23114.log +# +# If you would like to submit a bug report, please visit: +# http://bugreport.sun.com/bugreport/crash.jsp +# The crash happened outside the Java Virtual Machine in native code. +# See problematic frame for where to report the bug. +... + +# cat impalad.worker_node_123.impala.log.ERROR.20160623-140343.23114 + +Log file created at: 2016/06/23 14:03:43 +Running on machine:.worker_node_123 +Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg +E0623 14:03:43.911002 23114 logging.cc:118] stderr will be logged to this file. +Wrote minidump to /var/log/impala-minidumps/impalad/0980da2d-a905-01e1-25ff883a-04ee027a.dmp + +</code></pre> + + <p class="p"> + The resulting minidump file is much smaller than the corresponding core file, + making it much easier to supply diagnostic information to <span class="keyword">the appropriate support channel</span>. + </p> + +<pre class="pre codeblock"><code> +# pwd +/var/log/impalad +# cd ../impala-minidumps/impalad +# ls +0980da2d-a905-01e1-25ff883a-04ee027a.dmp +# du -kh * +2.4M 0980da2d-a905-01e1-25ff883a-04ee027a.dmp + +</code></pre> + </div> + </article> + +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_char.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_char.html b/docs/build/html/topics/impala_char.html new file mode 100644 index 0000000..e0b4cb9 --- /dev/null +++ b/docs/build/html/topics/impala_char.html @@ -0,0 +1,305 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_datatypes.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="char"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>CHAR Data Type (Impala 2.0 or higher only)</title></head><body id="char"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">CHAR Data Type (<span class="keyword">Impala 2.0</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If + values are longer than the specified length, Impala truncates any trailing characters. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + + <p class="p"> + In the column definition of a <code class="ph codeph">CREATE TABLE</code> statement: + </p> + +<pre class="pre codeblock"><code><var class="keyword varname">column_name</var> CHAR(<var class="keyword varname">length</var>)</code></pre> + + <p class="p"> + The maximum length you can specify is 255. + </p> + + <p class="p"> + <strong class="ph b">Semantics of trailing spaces:</strong> + </p> + + <ul class="ul"> + <li class="li"> + When you store a <code class="ph codeph">CHAR</code> value shorter than the specified length in a table, queries return + the value padded with trailing spaces if necessary; the resulting value has the same length as specified in + the column definition. + </li> + + <li class="li"> + If you store a <code class="ph codeph">CHAR</code> value containing trailing spaces in a table, those trailing spaces are + not stored in the data file. When the value is retrieved by a query, the result could have a different + number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the + specified length of the column. + </li> + + <li class="li"> + If you compare two <code class="ph codeph">CHAR</code> values that differ only in the number of trailing spaces, those + values are considered identical. + </li> + </ul> + + <p class="p"> + <strong class="ph b">Partitioning:</strong> This type can be used for partition key columns. Because of the efficiency advantage + of numeric values over character-based values, if the partition key is a string representation of a number, + prefer to use an integer type with sufficient range (<code class="ph codeph">INT</code>, <code class="ph codeph">BIGINT</code>, and so + on) where practical. + </p> + + <p class="p"> + <strong class="ph b">HBase considerations:</strong> This data type cannot be used with HBase tables. + </p> + + <p class="p"> + <strong class="ph b">Parquet considerations:</strong> + </p> + + <ul class="ul"> + <li class="li"> + This type can be read from and written to Parquet files. + </li> + + <li class="li"> + There is no requirement for a particular level of Parquet. + </li> + + <li class="li"> + Parquet files generated by Impala and containing this type can be freely interchanged with other components + such as Hive and MapReduce. + </li> + + <li class="li"> + Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files. + </li> + + <li class="li"> + Parquet data files might contain values that are longer than allowed by the + <code class="ph codeph">CHAR(<var class="keyword varname">n</var>)</code> length limit. Impala ignores any extra trailing characters when + it processes those values during a query. + </li> + </ul> + + <p class="p"> + <strong class="ph b">Text table considerations:</strong> + </p> + + <p class="p"> + Text data files might contain values that are longer than allowed for a particular + <code class="ph codeph">CHAR(<var class="keyword varname">n</var>)</code> column. Any extra trailing characters are ignored when Impala + processes those values during a query. Text data files can also contain values that are shorter than the + defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data + files produced by Impala <code class="ph codeph">INSERT</code> statements do not include any trailing blanks for + <code class="ph codeph">CHAR</code> columns. + </p> + + <p class="p"><strong class="ph b">Avro considerations:</strong></p> + <p class="p"> + The Avro specification allows string values up to 2**64 bytes in length. + Impala queries for Avro tables use 32-bit integers to hold string lengths. + In <span class="keyword">Impala 2.5</span> and higher, Impala truncates <code class="ph codeph">CHAR</code> + and <code class="ph codeph">VARCHAR</code> values in Avro tables to (2**31)-1 bytes. + If a query encounters a <code class="ph codeph">STRING</code> value longer than (2**31)-1 + bytes in an Avro table, the query fails. In earlier releases, + encountering such long values in an Avro table could cause a crash. + </p> + + <p class="p"> + <strong class="ph b">Compatibility:</strong> + </p> + + <p class="p"> + This type is available using <span class="keyword">Impala 2.0</span> or higher. + </p> + + <p class="p"> + Some other database systems make the length specification optional. For Impala, the length is required. + </p> + + + + <p class="p"> + <strong class="ph b">Internal details:</strong> Represented in memory as a byte array with the same size as the length + specification. Values that are shorter than the specified length are padded on the right with trailing + spaces. + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.0.0</span> + </p> + + <p class="p"> + <strong class="ph b">Column statistics considerations:</strong> Because this type has a fixed size, the maximum and average size + fields are always filled in for column statistics, even before you run the <code class="ph codeph">COMPUTE STATS</code> + statement. + </p> + + + + <p class="p"> + <strong class="ph b">UDF considerations:</strong> This type cannot be used for the argument or return type of a user-defined + function (UDF) or user-defined aggregate function (UDA). + </p> + + <p class="p"> + <strong class="ph b">Examples:</strong> + </p> + + <p class="p"> + These examples show how trailing spaces are not considered significant when comparing or processing + <code class="ph codeph">CHAR</code> values. <code class="ph codeph">CAST()</code> truncates any longer string to fit within the defined + length. If a <code class="ph codeph">CHAR</code> value is shorter than the specified length, it is padded on the right with + spaces until it matches the specified length. Therefore, <code class="ph codeph">LENGTH()</code> represents the length + including any trailing spaces, and <code class="ph codeph">CONCAT()</code> also treats the column value as if it has + trailing spaces. + </p> + +<pre class="pre codeblock"><code>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded"; ++--------------------------+ +| unpadded equal to padded | ++--------------------------+ +| true | ++--------------------------+ + +create table char_length(c char(3)); +insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3))); +select concat("[",c,"]") as c, length(c) from char_length; ++-------+-----------+ +| c | length(c) | ++-------+-----------+ +| [1 ] | 3 | +| [12 ] | 3 | +| [123] | 3 | +| [123] | 3 | ++-------+-----------+ +</code></pre> + + <p class="p"> + This example shows a case where data values are known to have a specific length, where <code class="ph codeph">CHAR</code> + is a logical data type to use. + + </p> + +<pre class="pre codeblock"><code>create table addresses + (id bigint, + street_name string, + state_abbreviation char(2), + country_abbreviation char(2)); +</code></pre> + + <p class="p"> + The following example shows how values written by Impala do not physically include the trailing spaces. It + creates a table using text format, with <code class="ph codeph">CHAR</code> values much shorter than the declared length, + and then prints the resulting data file to show that the delimited values are not separated by spaces. The + same behavior applies to binary-format Parquet data files. + </p> + +<pre class="pre codeblock"><code>create table char_in_text (a char(20), b char(30), c char(40)) + row format delimited fields terminated by ','; + +insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40))); + +-- Running this Linux command inside impala-shell using the ! shortcut. +!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*'; +foo,bar,baz +hello,goodbye,aloha +</code></pre> + + <p class="p"> + The following example further illustrates the treatment of spaces. It replaces the contents of the previous + table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved + within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the + leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala. + </p> + +<pre class="pre codeblock"><code>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40))); +!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*'; +trailing, leading and trailing, leading + +select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text; ++------------------------+----------------------------------+--------------------------------------------+ +| a | b | c | ++------------------------+----------------------------------+--------------------------------------------+ +| [trailing ] | [ leading and trailing ] | [ leading ] | ++------------------------+----------------------------------+--------------------------------------------+ +</code></pre> + + <p class="p"> + <strong class="ph b">Kudu considerations:</strong> + </p> + <p class="p"> + Currently, the data types <code class="ph codeph">DECIMAL</code>, <code class="ph codeph">TIMESTAMP</code>, <code class="ph codeph">CHAR</code>, <code class="ph codeph">VARCHAR</code>, + <code class="ph codeph">ARRAY</code>, <code class="ph codeph">MAP</code>, and <code class="ph codeph">STRUCT</code> cannot be used with Kudu tables. + </p> + + <p class="p"> + <strong class="ph b">Restrictions:</strong> + </p> + + <p class="p"> + Because the blank-padding behavior requires allocating the maximum length for each value in memory, for + scalability reasons avoid declaring <code class="ph codeph">CHAR</code> columns that are much longer than typical values in + that column. + </p> + + <p class="p"> + All data in <code class="ph codeph">CHAR</code> and <code class="ph codeph">VARCHAR</code> columns must be in a character encoding that + is compatible with UTF-8. If you have binary data from another database system (that is, a BLOB type), use + a <code class="ph codeph">STRING</code> column to hold it. + </p> + + <p class="p"> + When an expression compares a <code class="ph codeph">CHAR</code> with a <code class="ph codeph">STRING</code> or + <code class="ph codeph">VARCHAR</code>, the <code class="ph codeph">CHAR</code> value is implicitly converted to <code class="ph codeph">STRING</code> + first, with trailing spaces preserved. + </p> + +<pre class="pre codeblock"><code>select cast("foo " as char(5)) = 'foo' as "char equal to string"; ++----------------------+ +| char equal to string | ++----------------------+ +| false | ++----------------------+ +</code></pre> + + <p class="p"> + This behavior differs from other popular database systems. To get the expected result of + <code class="ph codeph">TRUE</code>, cast the expressions on both sides to <code class="ph codeph">CHAR</code> values of the appropriate + length: + </p> + +<pre class="pre codeblock"><code>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string"; ++----------------------+ +| char equal to string | ++----------------------+ +| true | ++----------------------+ +</code></pre> + + <p class="p"> + This behavior is subject to change in future releases. + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + + <p class="p"> + <a class="xref" href="impala_string.html#string">STRING Data Type</a>, <a class="xref" href="impala_varchar.html#varchar">VARCHAR Data Type (Impala 2.0 or higher only)</a>, + <a class="xref" href="impala_literals.html#string_literals">String Literals</a>, + <a class="xref" href="impala_string_functions.html#string_functions">Impala String Functions</a> + </p> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_datatypes.html">Data Types</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_cluster_sizing.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_cluster_sizing.html b/docs/build/html/topics/impala_cluster_sizing.html new file mode 100644 index 0000000..d1f2a51 --- /dev/null +++ b/docs/build/html/topics/impala_cluster_sizing.html @@ -0,0 +1,318 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_planning.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="cluster_sizing"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Cluster Sizing Guidelines for Impala</title></head><body id="cluster_sizing"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Cluster Sizing Guidelines for Impala</h1> + + + + <div class="body conbody"> + + <p class="p"> + + This document provides a very rough guideline to estimate the size of a cluster needed for a specific + customer application. You can use this information when planning how much and what type of hardware to + acquire for a new cluster, or when adding Impala workloads to an existing cluster. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + Before making purchase or deployment decisions, consult organizations with relevant experience + to verify the conclusions about hardware requirements based on your data volume and workload. + </div> + + + + <p class="p"> + Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently, + Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration. + Because work can be distributed in different ways for different queries, if some hosts are overloaded + compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance + and overall slowness + </p> + + <p class="p"> + For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM, + 12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of + DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads + similar to TPC-DS benchmark queries: + </p> + + <table class="table"><caption><span class="table--title-label">Table 1. </span><span class="title">Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</span></caption><colgroup><col><col><col><col><col><col></colgroup><thead class="thead"> + <tr class="row"> + <th class="entry nocellnorowborder" id="cluster_sizing__entry__1"> + Data Size + </th> + <th class="entry nocellnorowborder" id="cluster_sizing__entry__2"> + 1 query + </th> + <th class="entry nocellnorowborder" id="cluster_sizing__entry__3"> + 10 queries + </th> + <th class="entry nocellnorowborder" id="cluster_sizing__entry__4"> + 100 queries + </th> + <th class="entry nocellnorowborder" id="cluster_sizing__entry__5"> + 1000 queries + </th> + <th class="entry nocellnorowborder" id="cluster_sizing__entry__6"> + 2000 queries + </th> + </tr> + </thead><tbody class="tbody"> + <tr class="row"> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__1 "> + <strong class="ph b">250 GB</strong> + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__2 "> + 2 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__3 "> + 2 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__4 "> + 5 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__5 "> + 35 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__6 "> + 70 + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__1 "> + <strong class="ph b">500 GB</strong> + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__2 "> + 2 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__3 "> + 2 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__4 "> + 10 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__5 "> + 70 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__6 "> + 135 + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__1 "> + <strong class="ph b">1 TB</strong> + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__2 "> + 2 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__3 "> + 2 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__4 "> + 15 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__5 "> + 135 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__6 "> + 270 + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__1 "> + <strong class="ph b">15 TB</strong> + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__2 "> + 2 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__3 "> + 20 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__4 "> + 200 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__5 "> + N/A + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__6 "> + N/A + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__1 "> + <strong class="ph b">30 TB</strong> + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__2 "> + 4 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__3 "> + 40 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__4 "> + 400 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__5 "> + N/A + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__6 "> + N/A + </td> + </tr> + <tr class="row"> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__1 "> + <strong class="ph b">60 TB</strong> + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__2 "> + 8 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__3 "> + 80 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__4 "> + 800 + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__5 "> + N/A + </td> + <td class="entry nocellnorowborder" headers="cluster_sizing__entry__6 "> + N/A + </td> + </tr> + </tbody></table> + + <section class="section" id="cluster_sizing__sizing_factors"><h2 class="title sectiontitle">Factors Affecting Scalability</h2> + + + + <p class="p"> + A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each + node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with + cluster size. However, for some workloads, the scalability might be bounded by the network, or even by + memory. + </p> + + <p class="p"> + If the workload is already network bound (on a 10 GB network), increasing the cluster size wonât reduce + the network load; in fact, a larger cluster could increase network traffic because some queries involve + <span class="q">"broadcast"</span> operations to all DataNodes. Therefore, boosting the cluster size does not improve query + throughput in a network-constrained environment. + </p> + + <p class="p"> + Letâs look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional + concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor + network is saturated yet. This can happen because currently Impala uses only a single core per node to + process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system + cannot run more than 2 such queries at the same time. + </p> + + <p class="p"> + Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound + workload. Itâs just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 + GB/sec. Consider increasing the memory per node. + </p> + + <p class="p"> + As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the + throughput estimate. + </p> + </section> + + <section class="section" id="cluster_sizing__sizing_details"><h2 class="title sectiontitle">A More Precise Approach</h2> + + + + <p class="p"> + A more precise sizing estimate would require not only queries per minute (QPM), but also an average data + size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total + data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed: + </p> + +<pre class="pre codeblock"><code>Eq 1: N > QPM * D / 100 GB +</code></pre> + + <p class="p"> + Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is + required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can + estimate the number of node using our equation above. + </p> + +<pre class="pre codeblock"><code>N > QPM * D / 100GB +N > 400 * 50GB / 100GB +N > 200 +</code></pre> + + <p class="p"> + Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500. + </p> + + <p class="p"> + Depending on the complexity of the query, the processing rate of query might change. If the query has more + joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the + process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and + filtering on numbers, the processing rate can be higher. + </p> + </section> + + <section class="section" id="cluster_sizing__sizing_mem_estimate"><h2 class="title sectiontitle">Estimating Memory Requirements</h2> + + + + + <p class="p"> + Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the + joined tables, using the <code class="ph codeph"><a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE + STATS</a></code> statement. However, joining big tables does consume more memory. Follow the steps + below to calculate the minimum memory requirement. + </p> + + <p class="p"> + Suppose you are running the following join: + </p> + +<pre class="pre codeblock"><code>select a.*, b.col_1, b.col_2, ⦠b.col_n +from a, b +where a.key = b.key +and b.col_1 in (1,2,4...) +and b.col_4 in (....); +</code></pre> + + <p class="p"> + And suppose table <code class="ph codeph">B</code> is smaller than table <code class="ph codeph">A</code> (but still a large table). + </p> + + <p class="p"> + The memory requirement for the query is the right-hand table (<code class="ph codeph">B</code>), after decompression, + filtering (<code class="ph codeph">b.col_n in ...</code>) and after projection (only using certain columns) must be less + than the total memory of the entire cluster. + </p> + +<pre class="pre codeblock"><code>Cluster Total Memory Requirement = Size of the smaller table * + selectivity factor from the predicate * + projection factor * compression ratio +</code></pre> + + <p class="p"> + In this case, assume that table <code class="ph codeph">B</code> is 100 TB in Parquet format with 200 columns. The + predicate on <code class="ph codeph">B</code> (<code class="ph codeph">b.col_1 in ...and b.col_4 in ...</code>) will select only 10% of + the rows from <code class="ph codeph">B</code> and for projection, we are only projecting 5 columns out of 200 columns. + Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor. + </p> + +<pre class="pre codeblock"><code>Cluster Total Memory Requirement = Size of the smaller table * + selectivity factor from the predicate * + projection factor * compression ratio + = 100TB * 10% * 5/200 * 3 + = 0.75TB + = 750GB +</code></pre> + + <p class="p"> + So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 + TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries + of this magnitude. + </p> + </section> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_planning.html">Planning for Impala Deployment</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_comments.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_comments.html b/docs/build/html/topics/impala_comments.html new file mode 100644 index 0000000..e3d711a --- /dev/null +++ b/docs/build/html/topics/impala_comments.html @@ -0,0 +1,46 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_langref.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="comments"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Comments</title></head><body id="comments"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Comments</h1> + + + <div class="body conbody"> + + <p class="p"> + + Impala supports the familiar styles of SQL comments: + </p> + + <ul class="ul"> + <li class="li"> + All text from a <code class="ph codeph">--</code> sequence to the end of the line is considered a comment and ignored. + This type of comment can occur on a single line by itself, or after all or part of a statement. + </li> + + <li class="li"> + All text from a <code class="ph codeph">/*</code> sequence to the next <code class="ph codeph">*/</code> sequence is considered a + comment and ignored. This type of comment can stretch over multiple lines. This type of comment can occur + on one or more lines by itself, in the middle of a statement, or before or after a statement. + </li> + </ul> + + <p class="p"> + For example: + </p> + +<pre class="pre codeblock"><code>-- This line is a comment about a table. +create table ...; + +/* +This is a multi-line comment about a query. +*/ +select ...; + +select * from t /* This is an embedded comment about a query. */ where ...; + +select * from t -- This is a trailing comment within a multi-line command. +where ...; +</code></pre> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_langref.html">Impala SQL Language Reference</a></div></div></nav></article></main></body></html> \ No newline at end of file
