[18/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

jbapple Wed, 12 Apr 2017 11:25:27 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_optimize_partition_key_scans.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_optimize_partition_key_scans.html 
b/docs/build/html/topics/impala_optimize_partition_key_scans.html
new file mode 100644
index 0000000..07bfbb1
--- /dev/null
+++ b/docs/build/html/topics/impala_optimize_partition_key_scans.html
@@ -0,0 +1,188 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="optimize_partition_key_scans"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>OPTIMIZE_PARTITION_KEY_SCANS Query Option 
(Impala 2.5 or higher only)</title></head><body 
id="optimize_partition_key_scans"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" 
id="ariaid-title1">OPTIMIZE_PARTITION_KEY_SCANS Query Option (<span 
class="keyword">Impala 2.5</span> or higher only)</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Enables a fast code path for queries that apply simple aggregate 
functions to partition key
+      columns: <code class="ph codeph">MIN(<var class="keyword 
varname">key_column</var>)</code>, <code class="ph codeph">MAX(<var 
class="keyword varname">key_column</var>)</code>,
+      or <code class="ph codeph">COUNT(DISTINCT <var class="keyword 
varname">key_column</var>)</code>.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Type:</strong> Boolean; recognized values are 1 
and 0, or <code class="ph codeph">true</code> and <code class="ph 
codeph">false</code>;
+        any other value interpreted as <code class="ph codeph">false</code>
+      </p>
+    <p class="p">
+        <strong class="ph b">Default:</strong> <code class="ph 
codeph">false</code> (shown as 0 in output of <code class="ph 
codeph">SET</code> statement)
+      </p>
+
+    <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        In <span class="keyword">Impala 2.5.0</span>, only the value 1 enables 
the option, and the value
+        <code class="ph codeph">true</code> is not recognized. This limitation 
is
+        tracked by the issue
+        <a class="xref" 
href="https://issues.apache.org/jira/browse/IMPALA-3334"; 
target="_blank">IMPALA-3334</a>,
+        which shows the releases where the problem is fixed.
+      </div>
+
+    <p class="p">
+        <strong class="ph b">Added in:</strong> <span class="keyword">Impala 
2.5.0</span>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      This optimization speeds up common <span 
class="q">"introspection"</span> operations when using queries
+      to calculate the cardinality and range for partition key columns.
+    </p>
+
+    <p class="p">
+      This optimization does not apply if the queries contain any <code 
class="ph codeph">WHERE</code>,
+      <code class="ph codeph">GROUP BY</code>, or <code class="ph 
codeph">HAVING</code> clause. The relevant queries
+      should only compute the minimum, maximum, or number of distinct values 
for the
+      partition key columns across the whole table.
+    </p>
+
+    <p class="p">
+      This optimization is enabled by a query option because it skips some 
consistency checks
+      and therefore can return slightly different partition values if 
partitions are in the
+      process of being added, dropped, or loaded outside of Impala. Queries 
might exhibit different
+      behavior depending on the setting of this option in the following cases:
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        <p class="p">
+          If files are removed from a partition using HDFS or other non-Impala 
operations,
+          there is a period until the next <code class="ph 
codeph">REFRESH</code> of the table where regular
+          queries fail at run time because they detect the missing files. With 
this optimization
+          enabled, queries that evaluate only the partition key column values 
(not the contents of
+          the partition itself) succeed, and treat the partition as if it 
still exists.
+        </p>
+      </li>
+      <li class="li">
+        <p class="p">
+          If a partition contains any data files, but the data files do not 
contain any rows,
+          a regular query considers that the partition does not exist. With 
this optimization
+          enabled, the partition is treated as if it exists.
+        </p>
+        <p class="p">
+          If the partition includes no files at all, this optimization does 
not change the query
+          behavior: the partition is considered to not exist whether or not 
this optimization is enabled.
+        </p>
+      </li>
+    </ul>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <p class="p">
+      The following example shows initial schema setup and the default 
behavior of queries that
+      return just the partition key column for a table:
+    </p>
+
+<pre class="pre codeblock"><code>
+-- Make a partitioned table with 3 partitions.
+create table t1 (s string) partitioned by (year int);
+insert into t1 partition (year=2015) values ('last year');
+insert into t1 partition (year=2016) values ('this year');
+insert into t1 partition (year=2017) values ('next year');
+
+-- Regardless of the option setting, this query must read the
+-- data files to know how many rows to return for each year value.
+explain select year from t1;
++-----------------------------------------------------+
+| Explain String                                      |
++-----------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=0B VCores=0 |
+|                                                     |
+| F00:PLAN FRAGMENT [UNPARTITIONED]                   |
+|   00:SCAN HDFS [key_cols.t1]                        |
+|      partitions=3/3 files=4 size=40B                |
+|      table stats: 3 rows total                      |
+|      column stats: all                              |
+|      hosts=3 per-host-mem=unavailable               |
+|      tuple-ids=0 row-size=4B cardinality=3          |
++-----------------------------------------------------+
+
+-- The aggregation operation means the query does not need to read
+-- the data within each partition: the result set contains exactly 1 row
+-- per partition, derived from the partition key column value.
+-- By default, Impala still includes a 'scan' operation in the query.
+explain select distinct year from t1;
++------------------------------------------------------------------------------------+
+| Explain String                                                               
      |
++------------------------------------------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=0B VCores=0                          
      |
+|                                                                              
      |
+| 01:AGGREGATE [FINALIZE]                                                      
      |
+| |  group by: year                                                            
      |
+| |                                                                            
      |
+| 00:SCAN HDFS [key_cols.t1]                                                   
      |
+|    partitions=0/0 files=0 size=0B                                            
      |
++------------------------------------------------------------------------------------+
+</code></pre>
+
+    <p class="p">
+      The following examples show how the plan is made more efficient when the
+      <code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code> option is 
enabled:
+    </p>
+
+<pre class="pre codeblock"><code>
+set optimize_partition_key_scans=1;
+OPTIMIZE_PARTITION_KEY_SCANS set to 1
+
+-- The aggregation operation is turned into a UNION internally,
+-- with constant values known in advance based on the metadata
+-- for the partitioned table.
+explain select distinct year from t1;
++-----------------------------------------------------+
+| Explain String                                      |
++-----------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=0B VCores=0 |
+|                                                     |
+| F00:PLAN FRAGMENT [UNPARTITIONED]                   |
+|   01:AGGREGATE [FINALIZE]                           |
+|   |  group by: year                                 |
+|   |  hosts=1 per-host-mem=unavailable               |
+|   |  tuple-ids=1 row-size=4B cardinality=3          |
+|   |                                                 |
+|   00:UNION                                          |
+|      constant-operands=3                            |
+|      hosts=1 per-host-mem=unavailable               |
+|      tuple-ids=0 row-size=4B cardinality=3          |
++-----------------------------------------------------+
+
+-- The same optimization applies to other aggregation queries
+-- that only return values based on partition key columns:
+-- MIN, MAX, COUNT(DISTINCT), and so on.
+explain select min(year) from t1;
++-----------------------------------------------------+
+| Explain String                                      |
++-----------------------------------------------------+
+| Estimated Per-Host Requirements: Memory=0B VCores=0 |
+|                                                     |
+| F00:PLAN FRAGMENT [UNPARTITIONED]                   |
+|   01:AGGREGATE [FINALIZE]                           |
+|   |  output: min(year)                              |
+|   |  hosts=1 per-host-mem=unavailable               |
+|   |  tuple-ids=1 row-size=4B cardinality=1          |
+|   |                                                 |
+|   00:UNION                                          |
+|      constant-operands=3                            |
+|      hosts=1 per-host-mem=unavailable               |
+|      tuple-ids=0 row-size=4B cardinality=3          |
++-----------------------------------------------------+
+</code></pre>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_order_by.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_order_by.html 
b/docs/build/html/topics/impala_order_by.html
new file mode 100644
index 0000000..c3f5105
--- /dev/null
+++ b/docs/build/html/topics/impala_order_by.html
@@ -0,0 +1,407 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_select.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="order_by"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>ORDER BY Clause</title></head><body 
id="order_by"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">ORDER BY Clause</h1>
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      The familiar <code class="ph codeph">ORDER BY</code> clause of a <code 
class="ph codeph">SELECT</code> statement sorts the result set
+      based on the values from one or more columns.
+    </p>
+
+    <p class="p">
+      For distributed queries, this is a relatively expensive operation, 
because the entire result set must be
+      produced and transferred to one node before the sorting can happen. This 
can require more memory capacity
+      than a query without <code class="ph codeph">ORDER BY</code>. Even if 
the query takes approximately the same time to finish
+      with or without the <code class="ph codeph">ORDER BY</code> clause, 
subjectively it can appear slower because no results
+      are available until all processing is finished, rather than results 
coming back gradually as rows matching
+      the <code class="ph codeph">WHERE</code> clause are found. Therefore, if 
you only need the first N results from the sorted
+      result set, also include the <code class="ph codeph">LIMIT</code> 
clause, which reduces network overhead and the memory
+      requirement on the coordinator node.
+    </p>
+
+    <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+      <p class="p">
+        In Impala 1.4.0 and higher, the <code class="ph codeph">LIMIT</code> 
clause is now optional (rather than required) for
+        queries that use the <code class="ph codeph">ORDER BY</code> clause. 
Impala automatically uses a temporary disk work area
+        to perform the sort if the sort operation would otherwise exceed the 
Impala memory limit for a particular
+        DataNode.
+      </p>
+    </div>
+
+    <p class="p">
+        <strong class="ph b">Syntax:</strong>
+      </p>
+
+    <p class="p">
+      The full syntax for the <code class="ph codeph">ORDER BY</code> clause 
is:
+    </p>
+
+<pre class="pre codeblock"><code>ORDER BY <var class="keyword 
varname">col_ref</var> [, <var class="keyword varname">col_ref</var> ...] [ASC 
| DESC] [NULLS FIRST | NULLS LAST]
+
+col_ref ::= <var class="keyword varname">column_name</var> | <var 
class="keyword varname">integer_literal</var>
+</code></pre>
+
+    <p class="p">
+      Although the most common usage is <code class="ph codeph">ORDER BY <var 
class="keyword varname">column_name</var></code>, you can also
+      specify <code class="ph codeph">ORDER BY 1</code> to sort by the first 
column of the result set, <code class="ph codeph">ORDER BY
+      2</code> to sort by the second column, and so on. The number must be a 
numeric literal, not some other kind
+      of constant expression. (If the argument is some other expression, even 
a <code class="ph codeph">STRING</code> value, the
+      query succeeds but the order of results is undefined.)
+    </p>
+
+    <p class="p">
+      <code class="ph codeph">ORDER BY <var class="keyword 
varname">column_number</var></code> can only be used when the query explicitly 
lists
+      the columns in the <code class="ph codeph">SELECT</code> list, not with 
<code class="ph codeph">SELECT *</code> queries.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Ascending and descending sorts:</strong>
+    </p>
+
+    <p class="p">
+      The default sort order (the same as using the <code class="ph 
codeph">ASC</code> keyword) puts the smallest values at the
+      start of the result set, and the largest values at the end. Specifying 
the <code class="ph codeph">DESC</code> keyword
+      reverses that order.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Sort order for NULL values:</strong>
+    </p>
+
+    <p class="p">
+      See <a class="xref" href="impala_literals.html#null">NULL</a> for 
details about how <code class="ph codeph">NULL</code> values are positioned
+      in the sorted result set, and how to use the <code class="ph 
codeph">NULLS FIRST</code> and <code class="ph codeph">NULLS LAST</code>
+      clauses. (The sort position for <code class="ph codeph">NULL</code> 
values in <code class="ph codeph">ORDER BY ... DESC</code> queries is
+      changed in Impala 1.2.1 and higher to be more standards-compliant, and 
the <code class="ph codeph">NULLS FIRST</code> and
+      <code class="ph codeph">NULLS LAST</code> keywords are new in Impala 
1.2.1.)
+    </p>
+
+    <p class="p">
+        Prior to Impala 1.4.0, Impala required any query including an
+        <code class="ph codeph"><a class="xref" 
href="../shared/../topics/impala_order_by.html#order_by">ORDER BY</a></code> 
clause to also use a
+        <code class="ph codeph"><a class="xref" 
href="../shared/../topics/impala_limit.html#limit">LIMIT</a></code> clause. In 
Impala 1.4.0 and
+        higher, the <code class="ph codeph">LIMIT</code> clause is optional 
for <code class="ph codeph">ORDER BY</code> queries. In cases where
+        sorting a huge result set requires enough memory to exceed the Impala 
memory limit for a particular node,
+        Impala automatically uses a temporary disk work area to perform the 
sort operation.
+      </p>
+
+    
+    <p class="p">
+        <strong class="ph b">Complex type considerations:</strong>
+      </p>
+
+    <p class="p">
+      In <span class="keyword">Impala 2.3</span> and higher, the complex data 
types <code class="ph codeph">STRUCT</code>,
+      <code class="ph codeph">ARRAY</code>, and <code class="ph 
codeph">MAP</code> are available. These columns cannot
+      be referenced directly in the <code class="ph codeph">ORDER BY</code> 
clause.
+      When you query a complex type column, you use join notation to <span 
class="q">"unpack"</span> the elements
+      of the complex type, and within the join query you can include an <code 
class="ph codeph">ORDER BY</code>
+      clause to control the order in the result set of the scalar elements 
from the complex type.
+      See <a class="xref" 
href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or 
higher only)</a> for details about Impala support for complex types.
+    </p>
+
+    <p class="p">
+      The following query shows how a complex type column cannot be directly 
used in an <code class="ph codeph">ORDER BY</code> clause:
+    </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE games (id BIGINT, score ARRAY 
&lt;BIGINT&gt;) STORED AS PARQUET;
+...use LOAD DATA to load externally created Parquet files into the table...
+SELECT id FROM games ORDER BY score DESC;
+ERROR: AnalysisException: ORDER BY expression 'score' with complex type 
'ARRAY&lt;BIGINT&gt;' is not supported.
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Examples:</strong>
+      </p>
+
+    <p class="p">
+      The following query retrieves the user ID and score, only for scores 
greater than one million,
+      with the highest scores for each user listed first.
+      Because the individual array elements are now represented as separate 
rows in the result set,
+      they can be used in the <code class="ph codeph">ORDER BY</code> clause, 
referenced using the <code class="ph codeph">ITEM</code>
+      pseudocolumn that represents each array element.
+    </p>
+
+<pre class="pre codeblock"><code>SELECT id, item FROM games, games.score
+  WHERE item &gt; 1000000
+ORDER BY id, item desc;
+</code></pre>
+
+    <p class="p">
+      The following queries use similar <code class="ph codeph">ORDER 
BY</code> techniques with variations of the <code class="ph codeph">GAMES</code>
+      table, where the complex type is an <code class="ph codeph">ARRAY</code> 
containing <code class="ph codeph">STRUCT</code> or <code class="ph 
codeph">MAP</code>
+      elements to represent additional details about each game that was played.
+      For an array of structures, the fields of the structure are referenced 
as <code class="ph codeph">ITEM.<var class="keyword 
varname">field_name</var></code>.
+      For an array of maps, the keys and values within each array element are 
referenced as <code class="ph codeph">ITEM.KEY</code>
+      and <code class="ph codeph">ITEM.VALUE</code>.
+    </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE games2 (id BIGINT, play array 
&lt; struct &lt;game_name: string, score: BIGINT, high_score: boolean&gt; &gt;) 
STORED AS PARQUET
+...use LOAD DATA to load externally created Parquet files into the table...
+SELECT id, item.game_name, item.score FROM games2, games2.play
+  WHERE item.score &gt; 1000000
+ORDER BY id, item.score DESC;
+
+CREATE TABLE games3 (id BIGINT, play ARRAY &lt; MAP &lt;STRING, BIGINT&gt; 
&gt;) STORED AS PARQUET;
+...use LOAD DATA to load externally created Parquet files into the table...
+SELECT id, info.key AS k, info.value AS v from games3, games3.play AS plays, 
games3.play.item AS info
+  WHERE info.KEY = 'score' AND info.VALUE &gt; 1000000
+ORDER BY id, info.value desc;
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      Although the <code class="ph codeph">LIMIT</code> clause is now optional 
on <code class="ph codeph">ORDER BY</code> queries, if your
+      query only needs some number of rows that you can predict in advance, 
use the <code class="ph codeph">LIMIT</code> clause
+      to reduce unnecessary processing. For example, if the query has a clause 
<code class="ph codeph">LIMIT 10</code>, each data
+      node sorts its portion of the relevant result set and only returns 10 
rows to the coordinator node. The
+      coordinator node picks the 10 highest or lowest row values out of this 
small intermediate result set.
+    </p>
+
+    <p class="p">
+      If an <code class="ph codeph">ORDER BY</code> clause is applied to an 
early phase of query processing, such as a subquery
+      or a view definition, Impala ignores the <code class="ph codeph">ORDER 
BY</code> clause. To get ordered results from a
+      subquery or view, apply an <code class="ph codeph">ORDER BY</code> 
clause to the outermost or final <code class="ph codeph">SELECT</code>
+      level.
+    </p>
+
+    <p class="p">
+      <code class="ph codeph">ORDER BY</code> is often used in combination 
with <code class="ph codeph">LIMIT</code> to perform <span 
class="q">"top-N"</span>
+      queries:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT user_id AS "Top 10 Visitors", 
SUM(page_views) FROM web_stats
+  GROUP BY page_views, user_id
+  ORDER BY SUM(page_views) DESC LIMIT 10;
+</code></pre>
+
+    <p class="p">
+      <code class="ph codeph">ORDER BY</code> is sometimes used in combination 
with <code class="ph codeph">OFFSET</code> and
+      <code class="ph codeph">LIMIT</code> to paginate query results, although 
it is relatively inefficient to issue multiple
+      queries like this against the large tables typically used with Impala:
+    </p>
+
+<pre class="pre codeblock"><code>SELECT page_title AS "Page 1 of search 
results", page_url FROM search_content
+  WHERE LOWER(page_title) LIKE '%game%')
+  ORDER BY page_title LIMIT 10 OFFSET 0;
+SELECT page_title AS "Page 2 of search results", page_url FROM search_content
+  WHERE LOWER(page_title) LIKE '%game%')
+  ORDER BY page_title LIMIT 10 OFFSET 10;
+SELECT page_title AS "Page 3 of search results", page_url FROM search_content
+  WHERE LOWER(page_title) LIKE '%game%')
+  ORDER BY page_title LIMIT 10 OFFSET 20;
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Internal details:</strong>
+      </p>
+
+    <p class="p">
+      Impala sorts the intermediate results of an <code class="ph 
codeph">ORDER BY</code> clause in memory whenever practical. In
+      a cluster of N DataNodes, each node sorts roughly 1/Nth of the result 
set, the exact proportion varying
+      depending on how the data matching the query is distributed in HDFS.
+    </p>
+
+    <p class="p">
+      If the size of the sorted intermediate result set on any DataNode would 
cause the query to exceed the Impala
+      memory limit, Impala sorts as much as practical in memory, then writes 
partially sorted data to disk. (This
+      technique is known in industry terminology as <span class="q">"external 
sorting"</span> and <span class="q">"spilling to disk"</span>.) As each
+      8 MB batch of data is written to disk, Impala frees the corresponding 
memory to sort a new 8 MB batch of
+      data. When all the data has been processed, a final merge sort operation 
is performed to correctly order the
+      in-memory and on-disk results as the result set is transmitted back to 
the coordinator node. When external
+      sorting becomes necessary, Impala requires approximately 60 MB of RAM at 
a minimum for the buffers needed to
+      read, write, and sort the intermediate results. If more RAM is available 
on the DataNode, Impala will use
+      the additional RAM to minimize the amount of disk I/O for sorting.
+    </p>
+
+    <p class="p">
+      This external sort technique is used as appropriate on each DataNode 
(possibly including the coordinator
+      node) to sort the portion of the result set that is processed on that 
node. When the sorted intermediate
+      results are sent back to the coordinator node to produce the final 
result set, the coordinator node uses a
+      merge sort technique to produce a final sorted result set without using 
any extra resources on the
+      coordinator node.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Configuration for disk usage:</strong>
+    </p>
+
+    <p class="p">
+        By default, intermediate files used during large sort, join, 
aggregation, or analytic function operations
+        are stored in the directory <span class="ph 
filepath">/tmp/impala-scratch</span> . These files are removed when the
+        operation finishes. (Multiple concurrent queries can perform 
operations that use the <span class="q">"spill to disk"</span>
+        technique, without any name conflicts for these temporary files.) You 
can specify a different location by
+        starting the <span class="keyword cmdname">impalad</span> daemon with 
the
+        <code class="ph codeph">--scratch_dirs="<var class="keyword 
varname">path_to_directory</var>"</code> configuration option.
+        You can specify a single directory, or a comma-separated list of 
directories. The scratch directories must
+        be on the local filesystem, not in HDFS. You might specify different 
directory paths for different hosts,
+        depending on the capacity and speed
+        of the available storage devices. In <span class="keyword">Impala 
2.3</span> or higher, Impala successfully starts (with a warning
+        Impala successfully starts (with a warning written to the log) if it 
cannot create or read and write files
+        in one of the scratch directories. If there is less than 1 GB free on 
the filesystem where that directory resides,
+        Impala still runs, but writes a warning message to its log.  If Impala 
encounters an error reading or writing
+        files in a scratch directory during a query, Impala logs the error and 
the query fails.
+      </p>
+
+
+
+
+
+    <p class="p">
+        <strong class="ph b">Sorting considerations:</strong> Although you can 
specify an <code class="ph codeph">ORDER BY</code> clause in an
+        <code class="ph codeph">INSERT ... SELECT</code> statement, any <code 
class="ph codeph">ORDER BY</code> clause is ignored and the
+        results are not necessarily sorted. An <code class="ph codeph">INSERT 
... SELECT</code> operation potentially creates
+        many different data files, prepared on different data nodes, and 
therefore the notion of the data being
+        stored in sorted order is impractical.
+      </p>
+
+    <div class="p">
+        An <code class="ph codeph">ORDER BY</code> clause without an 
additional <code class="ph codeph">LIMIT</code> clause is ignored in any
+        view definition. If you need to sort the entire result set from a 
view, use an <code class="ph codeph">ORDER BY</code>
+        clause in the <code class="ph codeph">SELECT</code> statement that 
queries the view. You can still make a simple <span class="q">"top
+        10"</span> report by combining the <code class="ph codeph">ORDER 
BY</code> and <code class="ph codeph">LIMIT</code> clauses in the same
+        view definition:
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table unsorted 
(x bigint);
+[localhost:21000] &gt; insert into unsorted values (1), (9), (3), (7), (5), 
(8), (4), (6), (2);
+[localhost:21000] &gt; create view sorted_view as select x from unsorted order 
by x;
+[localhost:21000] &gt; select x from sorted_view; -- ORDER BY clause in view 
has no effect.
++---+
+| x |
++---+
+| 1 |
+| 9 |
+| 3 |
+| 7 |
+| 5 |
+| 8 |
+| 4 |
+| 6 |
+| 2 |
++---+
+[localhost:21000] &gt; select x from sorted_view order by x; -- View query 
requires ORDER BY at outermost level.
++---+
+| x |
++---+
+| 1 |
+| 2 |
+| 3 |
+| 4 |
+| 5 |
+| 6 |
+| 7 |
+| 8 |
+| 9 |
++---+
+[localhost:21000] &gt; create view top_3_view as select x from unsorted order 
by x limit 3;
+[localhost:21000] &gt; select x from top_3_view; -- ORDER BY and LIMIT 
together in view definition are preserved.
++---+
+| x |
++---+
+| 1 |
+| 2 |
+| 3 |
++---+
+</code></pre>
+      </div>
+
+    <p class="p">
+      With the lifting of the requirement to include a <code class="ph 
codeph">LIMIT</code> clause in every <code class="ph codeph">ORDER
+      BY</code> query (in Impala 1.4 and higher):
+    </p>
+
+    <ul class="ul">
+      <li class="li">
+        <p class="p">
+          Now the use of scratch disk space raises the possibility of an <span 
class="q">"out of disk space"</span> error on a
+          particular DataNode, as opposed to the previous possibility of an 
<span class="q">"out of memory"</span> error. Make sure
+          to keep at least 1 GB free on the filesystem used for temporary 
sorting work.
+        </p>
+      </li>
+
+      <li class="li">
+        <p class="p">
+          The query options
+          <a class="xref" 
href="impala_default_order_by_limit.html#default_order_by_limit">DEFAULT_ORDER_BY_LIMIT</a>
 and
+          <a class="xref" 
href="impala_abort_on_default_limit_exceeded.html#abort_on_default_limit_exceeded">ABORT_ON_DEFAULT_LIMIT_EXCEEDED</a>,
+          which formerly controlled the behavior of <code class="ph 
codeph">ORDER BY</code> queries with no limit specified, are
+          now ignored.
+        </p>
+      </li>
+    </ul>
+
+    <p class="p">
+        In Impala 1.2.1 and higher, all <code class="ph codeph">NULL</code> 
values come at the end of the result set for
+        <code class="ph codeph">ORDER BY ... ASC</code> queries, and at the 
beginning of the result set for <code class="ph codeph">ORDER BY ...
+        DESC</code> queries. In effect, <code class="ph codeph">NULL</code> is 
considered greater than all other values for
+        sorting purposes. The original Impala behavior always put <code 
class="ph codeph">NULL</code> values at the end, even for
+        <code class="ph codeph">ORDER BY ... DESC</code> queries. The new 
behavior in Impala 1.2.1 makes Impala more compatible
+        with other popular database systems. In Impala 1.2.1 and higher, you 
can override or specify the sorting
+        behavior for <code class="ph codeph">NULL</code> by adding the clause 
<code class="ph codeph">NULLS FIRST</code> or <code class="ph codeph">NULLS
+        LAST</code> at the end of the <code class="ph codeph">ORDER BY</code> 
clause.
+      </p>
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table numbers 
(x int);
+[localhost:21000] &gt; insert into numbers values (1), (null), (2), (null), 
(3);
+[localhost:21000] &gt; select x from numbers order by x nulls first;
++------+
+| x    |
++------+
+| NULL |
+| NULL |
+| 1    |
+| 2    |
+| 3    |
++------+
+[localhost:21000] &gt; select x from numbers order by x desc nulls first;
++------+
+| x    |
++------+
+| NULL |
+| NULL |
+| 3    |
+| 2    |
+| 1    |
++------+
+[localhost:21000] &gt; select x from numbers order by x nulls last;
++------+
+| x    |
++------+
+| 1    |
+| 2    |
+| 3    |
+| NULL |
+| NULL |
++------+
+[localhost:21000] &gt; select x from numbers order by x desc nulls last;
++------+
+| x    |
++------+
+| 3    |
+| 2    |
+| 1    |
+| NULL |
+| NULL |
++------+
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+
+    <p class="p">
+      See <a class="xref" href="impala_select.html#select">SELECT 
Statement</a> for further examples of queries with the <code class="ph 
codeph">ORDER
+      BY</code> clause.
+    </p>
+
+    <p class="p">
+      Analytic functions use the <code class="ph codeph">ORDER BY</code> 
clause in a different context to define the sequence in
+      which rows are analyzed. See <a class="xref" 
href="impala_analytic_functions.html#analytic_functions">Impala Analytic 
Functions</a> for details.
+    </p>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_select.html">SELECT 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

[18/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

Reply via email to