5.10 versions.

jrussell Tue, 01 Nov 2016 16:13:37 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_txtfile.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_txtfile.xml b/docs/topics/impala_txtfile.xml
index ec8c059..543e2ff 100644
--- a/docs/topics/impala_txtfile.xml
+++ b/docs/topics/impala_txtfile.xml
@@ -4,7 +4,15 @@
 
   <title>Using Text Data Files with Impala Tables</title>
   <titlealts audience="PDF"><navtitle>Text Data Files</navtitle></titlealts>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="File Formats"/>
+      <data name="Category" value="Tables"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
@@ -15,10 +23,790 @@
       text files, such as CSV or TSV with commas or tabs for delimiters.
     </p>
 
-    
+    <p>
+      Text files are also very flexible in their column definitions. For 
example, a text file could have more
+      fields than the Impala table, and those extra fields are ignored during 
queries; or it could have fewer
+      fields than the Impala table, and those missing fields are treated as 
<codeph>NULL</codeph> values in
+      queries. You could have fields that were treated as numbers or 
timestamps in a table, then use <codeph>ALTER
+      TABLE ... REPLACE COLUMNS</codeph> to switch them to strings, or the 
reverse.
+    </p>
+
+    <table>
+      <title>Text Format Support in Impala</title>
+      <tgroup cols="5">
+        <colspec colname="1" colwidth="10*"/>
+        <colspec colname="2" colwidth="10*"/>
+        <colspec colname="3" colwidth="20*"/>
+        <colspec colname="4" colwidth="30*"/>
+        <colspec colname="5" colwidth="30*"/>
+        <thead>
+          <row>
+            <entry>
+              File Type
+            </entry>
+            <entry>
+              Format
+            </entry>
+            <entry>
+              Compression Codecs
+            </entry>
+            <entry>
+              Impala Can CREATE?
+            </entry>
+            <entry>
+              Impala Can INSERT?
+            </entry>
+          </row>
+        </thead>
+        <tbody>
+          <row conref="impala_file_formats.xml#file_formats/txtfile_support">
+            <entry/>
+          </row>
+        </tbody>
+      </tgroup>
+    </table>
+
+    <p outputclass="toc inpage"/>
+
+  </conbody>
+
+  <concept id="text_performance">
+
+    <title>Query Performance for Impala Text Tables</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        Data stored in text format is relatively bulky, and not as efficient 
to query as binary formats such as
+        Parquet. You typically use text tables with Impala if that is the 
format you receive the data and you do
+        not have control over that process, or if you are a relatively new 
Hadoop user and not familiar with
+        techniques to generate files in other formats. (Because the default 
format for <codeph>CREATE
+        TABLE</codeph> is text, you might create your first Impala tables as 
text without giving performance much
+        thought.) Either way, look for opportunities to use more efficient 
file formats for the tables used in your
+        most performance-critical queries.
+      </p>
+
+      <p>
+        For frequently queried data, you might load the original text data 
files into one Impala table, then use an
+        <codeph>INSERT</codeph> statement to transfer the data to another 
table that uses the Parquet file format;
+        the data is converted automatically as it is stored in the destination 
table.
+      </p>
+
+      <p>
+        For more compact data, consider using LZO compression for the text 
files. LZO is the only compression codec
+        that Impala supports for text data, because the <q>splittable</q> 
nature of LZO data files lets different
+        nodes work on different parts of the same file in parallel. See <xref 
href="impala_txtfile.xml#lzo"/> for
+        details.
+      </p>
+
+      <p rev="2.0.0">
+        In Impala 2.0 and later, you can also use text data compressed in the 
gzip, bzip2, or Snappy formats.
+        Because these compressed formats are not <q>splittable</q> in the way 
that LZO is, there is less
+        opportunity for Impala to parallelize queries on them. Therefore, use 
these types of compressed data only
+        for convenience if that is the format in which you receive the data. 
Prefer to use LZO compression for text
+        data if you have the choice, or convert the data to Parquet using an 
<codeph>INSERT ... SELECT</codeph>
+        statement to copy the original data into a Parquet table.
+      </p>
+
+      <note rev="2.2.0">
+        <p>
+          Impala supports bzip files created by the <codeph>bzip2</codeph> 
command, but not bzip files with
+          multiple streams created by the <codeph>pbzip2</codeph> command. 
Impala decodes only the data from the
+          first part of such files, leading to incomplete results.
+        </p>
+
+        <p>
+          The maximum size that Impala can accommodate for an individual bzip 
file is 1 GB (after uncompression).
+        </p>
+      </note>
+
+      <p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="text_ddl">
+
+    <title>Creating Text Tables</title>
+
+    <conbody>
+
+      <p>
+        <b>To create a table using text data files:</b>
+      </p>
+
+      <p>
+        If the exact format of the text data files (such as the delimiter 
character) is not significant, use the
+        <codeph>CREATE TABLE</codeph> statement with no extra clauses at the 
end to create a text-format table. For
+        example:
+      </p>
+
+<codeblock>create table my_table(id int, s string, n int, t timestamp, b 
boolean);
+</codeblock>
+
+      <p>
+        The data files created by any <codeph>INSERT</codeph> statements will 
use the Ctrl-A character (hex 01) as
+        a separator between each column value.
+      </p>
+
+      <p>
+        A common use case is to import existing text files into an Impala 
table. The syntax is more verbose; the
+        significant part is the <codeph>FIELDS TERMINATED BY</codeph> clause, 
which must be preceded by the
+        <codeph>ROW FORMAT DELIMITED</codeph> clause. The statement can end 
with a <codeph>STORED AS
+        TEXTFILE</codeph> clause, but that clause is optional because text 
format tables are the default. For
+        example:
+      </p>
+
+<codeblock>create table csv(id int, s string, n int, t timestamp, b boolean)
+  row format delimited
+  <ph id="csv">fields terminated by ',';</ph>
+
+create table tsv(id int, s string, n int, t timestamp, b boolean)
+  row format delimited
+  <ph id="tsv">fields terminated by '\t';</ph>
+
+create table pipe_separated(id int, s string, n int, t timestamp, b boolean)
+  row format delimited
+  <ph id="psv">fields terminated by '|'</ph>
+  stored as textfile;
+</codeblock>
+
+      <p>
+        You can create tables with specific separator characters to import 
text files in familiar formats such as
+        CSV, TSV, or pipe-separated. You can also use these tables to produce 
output data files, by copying data
+        into them through the <codeph>INSERT ... SELECT</codeph> syntax and 
then extracting the data files from the
+        Impala data directory.
+      </p>
+
+      <p rev="1.3.1">
+        In Impala 1.3.1 and higher, you can specify a delimiter character 
<codeph>'\</codeph><codeph>0'</codeph> to
+        use the ASCII 0 (<codeph>nul</codeph>) character for text tables:
+      </p>
+
+<codeblock rev="1.3.1">create table nul_separated(id int, s string, n int, t 
timestamp, b boolean)
+  row format delimited
+  fields terminated by '\0'
+  stored as textfile;
+</codeblock>
+
+      <note>
+        <p>
+          Do not surround string values with quotation marks in text data 
files that you construct. If you need to
+          include the separator character inside a field value, for example to 
put a string value with a comma
+          inside a CSV-format data file, specify an escape character on the 
<codeph>CREATE TABLE</codeph> statement
+          with the <codeph>ESCAPED BY</codeph> clause, and insert that 
character immediately before any separator
+          characters that need escaping.
+        </p>
+      </note>
+
+<!--
+      <p>
+        In the <cmdname>impala-shell</cmdname> interpreter, issue a command 
similar to:
+      </p>
+
+<codeblock>create table textfile_table (<varname>column_specs</varname>) 
stored as textfile;
+/* If the STORED AS clause is omitted, the default is a TEXTFILE with hex 01 
characters as the delimiter. */
+create table default_table (<varname>column_specs</varname>);
+/* Some optional clauses in the CREATE TABLE statement apply only to Text 
tables. */
+create table csv_table (<varname>column_specs</varname>) row format delimited 
fields terminated by ',';
+create table tsv_table (<varname>column_specs</varname>) row format delimited 
fields terminated by '\t';
+create table dos_table (<varname>column_specs</varname>) lines terminated by 
'\r';</codeblock>
+-->
+
+      <p>
+        Issue a <codeph>DESCRIBE FORMATTED 
<varname>table_name</varname></codeph> statement to see the details of
+        how each table is represented internally in Impala.
+      </p>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="text_data_files">
+
+    <title>Data Files for Text Tables</title>
+
+    <conbody>
+
+      <p>
+        When Impala queries a table with data in text format, it consults all 
the data files in the data directory
+        for that table, with some exceptions:
+      </p>
+
+      <ul rev="2.2.0">
+        <li>
+          <p>
+            Impala ignores any hidden files, that is, files whose names start 
with a dot or an underscore.
+          </p>
+        </li>
+
+        <li>
+          <p 
conref="../shared/impala_common.xml#common/ignore_file_extensions"/>
+        </li>
+
+        <li>
+<!-- Copied and slightly adapted text from later on in this same file. Turn 
into a conref. -->
+          <p>
+            Impala uses suffixes to recognize when text data files are 
compressed text. For Impala to recognize the
+            compressed text files, they must have the appropriate file 
extension corresponding to the compression
+            codec, either <codeph>.gz</codeph>, <codeph>.bz2</codeph>, or 
<codeph>.snappy</codeph>. The extensions
+            can be in uppercase or lowercase.
+          </p>
+        </li>
+
+        <li>
+          Otherwise, the file names are not significant. When you put files 
into an HDFS directory through ETL
+          jobs, or point Impala to an existing HDFS directory with the 
<codeph>CREATE EXTERNAL TABLE</codeph>
+          statement, or move data files under external control with the 
<codeph>LOAD DATA</codeph> statement,
+          Impala preserves the original filenames.
+        </li>
+      </ul>
+
+      <p>
+        Filenames for data produced through Impala <codeph>INSERT</codeph> 
statements are given unique names to
+        avoid filename conflicts.
+      </p>
+
+      <p>
+        An <codeph>INSERT ... SELECT</codeph> statement produces one data file 
from each node that processes the
+        <codeph>SELECT</codeph> part of the statement. An <codeph>INSERT ... 
VALUES</codeph> statement produces a
+        separate data file for each statement; because Impala is more 
efficient querying a small number of huge
+        files than a large number of tiny files, the <codeph>INSERT ... 
VALUES</codeph> syntax is not recommended
+        for loading a substantial volume of data. If you find yourself with a 
table that is inefficient due to too
+        many small data files, reorganize the data into a few large files by 
doing <codeph>INSERT ...
+        SELECT</codeph> to transfer the data to a new table.
+      </p>
+
+      <p>
+        <b>Special values within text data files:</b>
+      </p>
+
+      <ul>
+        <li rev="1.4.0">
+          <p>
+            Impala recognizes the literal strings <codeph>inf</codeph> for 
infinity and <codeph>nan</codeph> for
+            <q>Not a Number</q>, for <codeph>FLOAT</codeph> and 
<codeph>DOUBLE</codeph> columns.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Impala recognizes the literal string <codeph>\N</codeph> to 
represent <codeph>NULL</codeph>. When using
+            Sqoop, specify the options <codeph>--null-non-string</codeph> and 
<codeph>--null-string</codeph> to
+            ensure all <codeph>NULL</codeph> values are represented correctly 
in the Sqoop output files. By default,
+            Sqoop writes <codeph>NULL</codeph> values using the string 
<codeph>null</codeph>, which causes a
+            conversion error when such rows are evaluated by Impala. (A 
workaround for existing tables and data files
+            is to change the table properties through <codeph>ALTER TABLE 
<varname>name</varname> SET
+            TBLPROPERTIES("serialization.null.format"="null")</codeph>.)
+          </p>
+        </li>
+
+        <li>
+          <p conref="../shared/impala_common.xml#common/skip_header_lines"/>
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="text_etl">
+
+    <title>Loading Data into Impala Text Tables</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="ETL"/>
+      <data name="Category" value="Ingest"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        To load an existing text file into an Impala text table, use the 
<codeph>LOAD DATA</codeph> statement and
+        specify the path of the file in HDFS. That file is moved into the 
appropriate Impala data directory.
+      </p>
+
+      <p>
+        To load multiple existing text files into an Impala text table, use 
the <codeph>LOAD DATA</codeph>
+        statement and specify the HDFS path of the directory containing the 
files. All non-hidden files are moved
+        into the appropriate Impala data directory.
+      </p>
+
+      <p>
+        To convert data to text from any other file format supported by 
Impala, use a SQL statement such as:
+      </p>
+
+<codeblock>-- Text table with default delimiter, the hex 01 character.
+CREATE TABLE text_table AS SELECT * FROM other_file_format_table;
+
+-- Text table with user-specified delimiter. Currently, you cannot specify
+-- the delimiter as part of CREATE TABLE LIKE or CREATE TABLE AS SELECT.
+-- But you can change an existing text table to have a different delimiter.
+CREATE TABLE csv LIKE other_file_format_table;
+ALTER TABLE csv SET SERDEPROPERTIES ('serialization.format'=',', 
'field.delim'=',');
+INSERT INTO csv SELECT * FROM other_file_format_table;</codeblock>
+
+      <p>
+        This can be a useful technique to see how Impala represents special 
values within a text-format data file.
+        Use the <codeph>DESCRIBE FORMATTED</codeph> statement to see the HDFS 
directory where the data files are
+        stored, then use Linux commands such as <codeph>hdfs dfs -ls 
<varname>hdfs_directory</varname></codeph> and
+        <codeph>hdfs dfs -cat <varname>hdfs_file</varname></codeph> to display 
the contents of an Impala-created
+        text file.
+      </p>
+
+      <p>
+        To create a few rows in a text table for test purposes, you can use 
the <codeph>INSERT ... VALUES</codeph>
+        syntax:
+      </p>
+
+<codeblock>INSERT INTO <varname>text_table</varname> VALUES 
('string_literal',100,hex('hello world'));</codeblock>
+
+      <note>
+        Because Impala and the HDFS infrastructure are optimized for 
multi-megabyte files, avoid the <codeph>INSERT
+        ... VALUES</codeph> notation when you are inserting many rows. Each 
<codeph>INSERT ... VALUES</codeph>
+        statement produces a new tiny file, leading to fragmentation and 
reduced performance. When creating any
+        substantial volume of new data, use one of the bulk loading techniques 
such as <codeph>LOAD DATA</codeph>
+        or <codeph>INSERT ... SELECT</codeph>. Or, <xref 
href="impala_hbase.xml#impala_hbase">use an HBase
+        table</xref> for single-row <codeph>INSERT</codeph> operations, 
because HBase tables are not subject to the
+        same fragmentation issues as tables stored on HDFS.
+      </note>
+
+      <p>
+        When you create a text file for use with an Impala text table, specify 
<codeph>\N</codeph> to represent a
+        <codeph>NULL</codeph> value. For the differences between 
<codeph>NULL</codeph> and empty strings, see
+        <xref href="impala_literals.xml#null"/>.
+      </p>
+
+      <p>
+        If a text file has fewer fields than the columns in the corresponding 
Impala table, all the corresponding
+        columns are set to <codeph>NULL</codeph> when the data in that file is 
read by an Impala query.
+      </p>
+
+      <p>
+        If a text file has more fields than the columns in the corresponding 
Impala table, the extra fields are
+        ignored when the data in that file is read by an Impala query.
+      </p>
+
+      <p>
+        You can also use manual HDFS operations such as <codeph>hdfs dfs 
-put</codeph> or <codeph>hdfs dfs
+        -cp</codeph> to put data files in the data directory for an Impala 
table. When you copy or move new data
+        files into the HDFS directory for the Impala table, issue a 
<codeph>REFRESH
+        <varname>table_name</varname></codeph> statement in 
<cmdname>impala-shell</cmdname> before issuing the next
+        query against that table, to make Impala recognize the newly added 
files.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="lzo">
+
+    <title>Using LZO-Compressed Text Files</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="LZO"/>
+      <data name="Category" value="Compression"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">LZO support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">compression</indexterm>
+        Impala supports using text data files that employ LZO compression. 
Cloudera recommends compressing
+        text data files when practical. Impala queries are usually I/O-bound; 
reducing the amount of data read from
+        disk typically speeds up a query, despite the extra CPU work to 
uncompress the data in memory.
+      </p>
+
+      <p>
+        Impala can work with LZO-compressed text files are preferable to files 
compressed by other codecs, because
+        LZO-compressed files are <q>splittable</q>, meaning that different 
portions of a file can be uncompressed
+        and processed independently by different nodes.
+      </p>
+
+      <p>
+        Impala does not currently support writing LZO-compressed text files.
+      </p>
+
+      <p>
+        Because Impala can query LZO-compressed files but currently cannot 
write them, you use Hive to do the
+        initial <codeph>CREATE TABLE</codeph> and load the data, then switch 
back to Impala to run queries. For
+        instructions on setting up LZO compression for Hive <codeph>CREATE 
TABLE</codeph> and
+        <codeph>INSERT</codeph> statements, see
+        <xref 
href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO"; 
scope="external" format="html">the
+        LZO page on the Hive wiki</xref>. Once you have created an LZO text 
table, you can also manually add
+        LZO-compressed text files to it, produced by the
+        <xref href="http://www.lzop.org/"; scope="external" format="html"> 
<cmdname>lzop</cmdname></xref> command
+        or similar method.
+      </p>
+
+      <section id="lzo_setup">
+
+        <title>Preparing to Use LZO-Compressed Text Files</title>
+
+        <p>
+          Before using LZO-compressed tables in Impala, do the following 
one-time setup for each machine in the
+          cluster. Install the necessary packages using either the Cloudera 
public repository, a private repository
+          you establish, or by using packages. You must do these steps 
manually, whether or not the cluster is
+          managed by the Cloudera Manager product.
+        </p>
+
+        <ol>
+          <li>
+            <b>Prepare your systems to work with LZO using Cloudera 
repositories:</b>
+            <p>
+              <b>On systems managed by Cloudera Manager using parcels:</b>
+            </p>
+
+            <p>
+              See the setup instructions for the LZO parcel in the Cloudera 
Manager documentation for
+              <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_gpl_extras.html";
 scope="external" format="html">Cloudera
+              Manager 5</xref>.
+            </p>
+
+            <p>
+              <b>On systems managed by Cloudera Manager using packages, or not 
managed by Cloudera Manager:</b>
+            </p>
+
+            <p>
+              Download and install the appropriate file to each machine on 
which you intend to use LZO with Impala.
+              These files all come from the Cloudera
+              <xref 
href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/"; 
scope="external" format="html">GPL
+              extras</xref> download site. Install the:
+            </p>
+            <ul>
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/cloudera-gplextras4.repo";
 scope="external" format="repo">Red
+                Hat 5 repo file</xref> to 
<filepath>/etc/yum.repos.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo";
 scope="external" format="repo">Red
+                Hat 6 repo file</xref> to 
<filepath>/etc/yum.repos.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/sles/11/x86_64/gplextras/cloudera-gplextras4.repo";
 scope="external" format="repo">SUSE
+                repo file</xref> to <filepath>/etc/zypp/repos.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/ubuntu/lucid/amd64/gplextras/cloudera.list";
 scope="external" format="list">Ubuntu
+                10.04 list file</xref> to 
<filepath>/etc/apt/sources.list.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/ubuntu/precise/amd64/gplextras/cloudera.list";
 scope="external" format="list">Ubuntu
+                12.04 list file</xref> to 
<filepath>/etc/apt/sources.list.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/debian/squeeze/amd64/gplextras/cloudera.list";
 scope="external" format="list">Debian
+                list file</xref> to 
<filepath>/etc/apt/sources.list.d/</filepath>.
+              </li>
+            </ul>
+          </li>
+
+          <li>
+            <b>Configure Impala to use LZO:</b>
+            <p>
+              Use <b>one</b> of the following sets of commands to refresh your 
package management system's
+              repository information, install the base LZO support for Hadoop, 
and install the LZO support for
+              Impala.
+            </p>
+
+            <note rev="1.2.0">
+              <p rev="1.2.0">
+                The name of the Hadoop LZO package changed between CDH 4 and 
CDH 5. In CDH 4, the package name was
+                <codeph>hadoop-lzo-cdh4</codeph>. In CDH 5 and higher, the 
package name is <codeph>hadoop-lzo</codeph>.
+              </p>
+            </note>
+
+            <p>
+              <b>For RHEL/CentOS systems:</b>
+            </p>
+<codeblock>$ sudo yum update
+$ sudo yum install hadoop-lzo
+$ sudo yum install impala-lzo</codeblock>
+            <p>
+              <b>For SUSE systems:</b>
+            </p>
+<codeblock rev="1.2">$ sudo apt-get update
+$ sudo zypper install hadoop-lzo
+$ sudo zypper install impala-lzo</codeblock>
+            <p>
+              <b>For Debian/Ubuntu systems:</b>
+            </p>
+<codeblock>$ sudo zypper update
+$ sudo apt-get install hadoop-lzo
+$ sudo apt-get install impala-lzo</codeblock>
+            <note>
+              <p>
+                The level of the <codeph>impala-lzo-cdh4</codeph> package is 
closely tied to the version of Impala
+                you use. Any time you upgrade Impala, re-do the installation 
command for
+                <codeph>impala-lzo</codeph> on each applicable machine to make 
sure you have the appropriate
+                version of that package.
+              </p>
+            </note>
+          </li>
+
+          <li>
+            For <codeph>core-site.xml</codeph> on the client <b>and</b> server 
(that is, in the configuration
+            directories for both Impala and Hadoop), append 
<codeph>com.hadoop.compression.lzo.LzopCodec</codeph>
+            to the comma-separated list of codecs. For example:
+<codeblock>&lt;property&gt;
+  &lt;name&gt;io.compression.codecs&lt;/name&gt;
+  
&lt;value&gt;org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,
+        
org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,
+        
org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec&lt;/value&gt;
+&lt;/property&gt;</codeblock>
+            <note>
+              <p>
+                If this is the first time you have edited the Hadoop 
<filepath>core-site.xml</filepath> file, note
+                that the <filepath>/etc/hadoop/conf</filepath> directory is 
typically a symbolic link, so the
+                canonical <filepath>core-site.xml</filepath> might reside in a 
different directory:
+              </p>
+<codeblock>$ ls -l /etc/hadoop
+total 8
+lrwxrwxrwx. 1 root root   29 Feb 26  2013 conf -&gt; 
/etc/alternatives/hadoop-conf
+lrwxrwxrwx. 1 root root   10 Feb 26  2013 conf.dist -&gt; conf.empty
+drwxr-xr-x. 2 root root 4096 Feb 26  2013 conf.empty
+drwxr-xr-x. 2 root root 4096 Oct 28 15:46 conf.pseudo</codeblock>
+              <p>
+                If the <codeph>io.compression.codecs</codeph> property is 
missing from
+                <filepath>core-site.xml</filepath>, only add 
<codeph>com.hadoop.compression.lzo.LzopCodec</codeph>
+                to the new property value, not all the names from the 
preceding example.
+              </p>
+            </note>
+          </li>
+
+          <li>
+            <!-- To do:
+              Link to CM or other doc where that procedure is explained.
+              Run through the procedure in CM and cite the relevant safety 
valves to put the XML into.
+            -->
+            Restart the MapReduce and Impala services.
+          </li>
+        </ol>
+
+      </section>
+
+      <section id="lzo_create_table">
+
+        <title>Creating LZO Compressed Text Tables</title>
+
+        <p>
+          A table containing LZO-compressed text files must be created in Hive 
with the following storage clause:
+        </p>
+
+<codeblock>STORED AS
+    INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
+    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'</codeblock>
+
+<!--
+      <p>
+        In Hive, when writing LZO compressed text tables, you must include the 
following specification:
+      </p>
+
+<codeblock>hive&gt; SET hive.exec.compress.output=true;
+hive&gt; SET 
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;</codeblock>
+-->
+
+        <p>
+          Also, certain Hive settings need to be in effect. For example:
+        </p>
+
+<codeblock>hive&gt; SET mapreduce.output.fileoutputformat.compress=true;
+hive&gt; SET hive.exec.compress.output=true;
+hive&gt; SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
+hive&gt; CREATE TABLE lzo_t (s string) STORED AS
+  &gt; INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
+  &gt; OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
+hive&gt; INSERT INTO TABLE lzo_t SELECT col1, col2 FROM 
uncompressed_text_table;</codeblock>
+
+        <p>
+          Once you have created LZO-compressed text tables, you can convert 
data stored in other tables (regardless
+          of file format) by using the <codeph>INSERT ... SELECT</codeph> 
statement in Hive.
+        </p>
+
+        <p>
+          Files in an LZO-compressed table must use the <codeph>.lzo</codeph> 
extension. Examine the files in the
+          HDFS data directory after doing the <codeph>INSERT</codeph> in Hive, 
to make sure the files have the
+          right extension. If the required settings are not in place, you end 
up with regular uncompressed files,
+          and Impala cannot access the table because it finds data files with 
the wrong (uncompressed) format.
+        </p>
+
+        <p>
+          After loading data into an LZO-compressed text table, index the 
files so that they can be split. You
+          index the files by running a Java class,
+          <codeph>com.hadoop.compression.lzo.DistributedLzoIndexer</codeph>, 
through the Linux command line. This
+          Java class is included in the <codeph>hadoop-lzo</codeph> package.
+        </p>
+
+        <p>
+          Run the indexer using a command like the following:
+        </p>
+
+<codeblock>$ hadoop jar 
/usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
+  com.hadoop.compression.lzo.DistributedLzoIndexer 
/hdfs_location_of_table/</codeblock>
+
+        <note>
+          If the path of the JAR file in the preceding example is not 
recognized, do a <cmdname>find</cmdname>
+          command to locate <filepath>hadoop-lzo-*-gplextras.jar</filepath> 
and use that path.
+        </note>
+
+        <p>
+          Indexed files have the same name as the file they index, with the 
<codeph>.index</codeph> extension. If
+          the data files are not indexed, Impala queries still work, but the 
queries read the data from remote
+          DataNodes, which is very inefficient.
+        </p>
+
+        <!-- To do:
+          Here is the place to put some end-to-end examples once I have it
+          all working. Or at least the final step with Impala queries.
+          Have never actually gotten this part working yet due to mismatches
+          between the levels of Impala and LZO packages.
+        -->
+
+        <p>
+          Once the LZO-compressed tables are created, and data is loaded and 
indexed, you can query them through
+          Impala. As always, the first time you start 
<cmdname>impala-shell</cmdname> after creating a table in
+          Hive, issue an <codeph>INVALIDATE METADATA</codeph> statement so 
that Impala recognizes the new table.
+          (In Impala 1.2 and higher, you only have to run <codeph>INVALIDATE 
METADATA</codeph> on one node, rather
+          than on all the Impala nodes.)
+        </p>
+
+      </section>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="gzip">
+
+    <title>Using gzip, bzip2, or Snappy-Compressed Text Files</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Snappy"/>
+      <data name="Category" value="Gzip"/>
+      <data name="Category" value="Compression"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">gzip support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">bzip2 support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">Snappy support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">compression</indexterm>
+        In Impala 2.0 and later, Impala supports using text data files that 
employ gzip, bzip2, or Snappy
+        compression. These compression types are primarily for convenience 
within an existing ETL pipeline rather
+        than maximum performance. Although it requires less I/O to read 
compressed text than the equivalent
+        uncompressed text, files compressed by these codecs are not 
<q>splittable</q> and therefore cannot take
+        full advantage of the Impala parallel query capability.
+      </p>
+
+      <p>
+        As each bzip2- or Snappy-compressed text file is processed, the node 
doing the work reads the entire file
+        into memory and then decompresses it. Therefore, the node must have 
enough memory to hold both the
+        compressed and uncompressed data from the text file. The memory 
required to hold the uncompressed data is
+        difficult to estimate in advance, potentially causing problems on 
systems with low memory limits or with
+        resource management enabled. <ph rev="2.1.0">In Impala 2.1 and higher, 
this memory overhead is reduced for
+        gzip-compressed text files. The gzipped data is decompressed as it is 
read, rather than all at once.</ph>
+      </p>
+
+<!--
+    <p>
+    Impala can work with LZO-compressed text files but not GZip-compressed 
text.
+    LZO-compressed files are <q>splittable</q>, meaning that different 
portions of a file
+    can be uncompressed and processed independently by different nodes. 
GZip-compressed
+    files are not splittable, making them unsuitable for Impala-style 
distributed queries.
+    </p>
+-->
+
+      <p>
+        To create a table to hold gzip, bzip2, or Snappy-compressed text, 
create a text table with no special
+        compression options. Specify the delimiter and escape character if 
required, using the <codeph>ROW
+        FORMAT</codeph> clause.
+      </p>
+
+      <p>
+        Because Impala can query compressed text files but currently cannot 
write them, produce the compressed text
+        files outside Impala and use the <codeph>LOAD DATA</codeph> statement, 
manual HDFS commands to move them to
+        the appropriate Impala data directory. (Or, you can use <codeph>CREATE 
EXTERNAL TABLE</codeph> and point
+        the <codeph>LOCATION</codeph> attribute at a directory containing 
existing compressed text files.)
+      </p>
+
+      <p>
+        For Impala to recognize the compressed text files, they must have the 
appropriate file extension
+        corresponding to the compression codec, either <codeph>.gz</codeph>, 
<codeph>.bz2</codeph>, or
+        <codeph>.snappy</codeph>. The extensions can be in uppercase or 
lowercase.
+      </p>
+
+      <p>
+        The following example shows how you can create a regular text table, 
put different kinds of compressed and
+        uncompressed files into it, and Impala automatically recognizes and 
decompresses each one based on their
+        file extensions:
+      </p>
+
+<codeblock>create table csv_compressed (a string, b string, c string)
+  row format delimited fields terminated by ",";
+
+insert into csv_compressed values
+  ('one - uncompressed', 'two - uncompressed', 'three - uncompressed'),
+  ('abc - uncompressed', 'xyz - uncompressed', '123 - uncompressed');
+...make equivalent .gz, .bz2, and .snappy files and load them into same table 
directory...
+
+select * from csv_compressed;
++--------------------+--------------------+----------------------+
+| a                  | b                  | c                    |
++--------------------+--------------------+----------------------+
+| one - snappy       | two - snappy       | three - snappy       |
+| one - uncompressed | two - uncompressed | three - uncompressed |
+| abc - uncompressed | xyz - uncompressed | 123 - uncompressed   |
+| one - bz2          | two - bz2          | three - bz2          |
+| abc - bz2          | xyz - bz2          | 123 - bz2            |
+| one - gzip         | two - gzip         | three - gzip         |
+| abc - gzip         | xyz - gzip         | 123 - gzip           |
++--------------------+--------------------+----------------------+
+
+$ hdfs dfs -ls 
'hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/';
+...truncated for readability...
+75 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed.snappy
+79 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_bz2.csv.bz2
+80 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_gzip.csv.gz
+116 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/dd414df64d67d49b_data.0.
+</codeblock>
 
     </conbody>
 
   </concept>
 
+  <concept audience="Cloudera" id="txtfile_data_types">
+
+    <title>Data Type Considerations for Text Tables</title>
+
+    <conbody>
+
+      <p></p>
+
+    </conbody>
+
+  </concept>
 
+</concept>


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_udf.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_udf.xml b/docs/topics/impala_udf.xml
index 53dd8eb..2d2f3b5 100644
--- a/docs/topics/impala_udf.xml
+++ b/docs/topics/impala_udf.xml
@@ -8,6 +8,8 @@
       <data name="Category" value="Impala"/>
       <data name="Category" value="Impala Functions"/>
       <data name="Category" value="UDFs"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>
 
@@ -169,9 +171,11 @@ select real_words(letters) from word_games;</codeblock>
           </li>
 
           <li>
-            The return type must be a <q>Writable</q> type such as 
<codeph>Text</codeph> or
+            Prior to CDH 5.7 / Impala 2.5, the return type must be a 
<q>Writable</q> type such as <codeph>Text</codeph> or
             <codeph>IntWritable</codeph>, rather than a Java primitive type 
such as <codeph>String</codeph> or
-            <codeph>int</codeph>. Otherwise, the UDF will return 
<codeph>NULL</codeph>.
+            <codeph>int</codeph>. Otherwise, the UDF returns 
<codeph>NULL</codeph>.
+            <ph rev="2.5.0">In CDH 5.7 / Impala 2.5 and higher, this 
restriction is lifted, and both
+            UDF arguments and return values can be Java primitive types.</ph>
           </li>
 
           <li>
@@ -182,6 +186,12 @@ select real_words(letters) from word_games;</codeblock>
             Typically, a Java UDF will execute several times slower in Impala 
than the equivalent native UDF
             written in C++.
           </li>
+          <li rev="2.5.0 IMPALA-2843 CDH-39148">
+            In CDH 5.7 / Impala 2.5 and higher, you can transparently call 
Hive Java UDFs through Impala,
+            or call Impala Java UDFs through Hive. This feature does not apply 
to built-in Hive functions.
+            Any Impala Java UDFs created with older versions must be 
re-created using new <codeph>CREATE FUNCTION</codeph>
+            syntax, without any signature for arguments or the return value.
+          </li>
         </ul>
 
         <p>
@@ -254,6 +264,7 @@ select real_words(letters) from word_games;</codeblock>
             <codeph>WHERE</codeph> clause), directly on a column, and on the 
results of a string expression:
           </p>
 
+<!-- To do: adapt for signatureless syntax per CDH-39148 / IMPALA-2843. -->
 <codeblock>[localhost:21000] &gt; create database udfs;
 [localhost:21000] &gt; use udfs;
 localhost:21000] &gt; create function lower(string) returns string location 
'/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';
@@ -385,8 +396,9 @@ and other examples demonstrating this technique in
 
     <conbody>
 
-      <p>
-        To develop UDFs for Impala, download and install the 
<codeph>impala-udf-devel</codeph> package containing
+      <p rev="CDH-37080">
+        To develop UDFs for Impala, download and install the 
<codeph>impala-udf-devel</codeph> package (RHEL-based
+        distributions) or <codeph>impala-udf-dev</codeph> (Ubuntu and Debian). 
This package contains
         header files, sample source, and build configuration files.
       </p>
 
@@ -403,9 +415,10 @@ and other examples demonstrating this technique in
           <codeph>.repo</codeph> file for CDH 4 on RHEL 6</xref>.
         </li>
 
-        <li>
+        <li rev="CDH-37080">
           Use the familiar <codeph>yum</codeph>, <codeph>zypper</codeph>, or 
<codeph>apt-get</codeph> commands
-          depending on your operating system, with 
<codeph>impala-udf-devel</codeph> for the package name.
+          depending on your operating system. For the package name, specify 
<codeph>impala-udf-devel</codeph>
+          (RHEL-based distributions) or <codeph>impala-udf-dev</codeph> 
(Ubuntu and Debian).
         </li>
       </ol>
 
@@ -480,10 +493,12 @@ and other examples demonstrating this technique in
 
         <p>
           For the basic declarations needed to write a scalar UDF, see the 
header file
-          <filepath>udf-sample.h</filepath> within the sample build 
environment, which defines a simple function
+          <xref 
href="https://github.com/cloudera/impala-udf-samples/blob/master/udf-sample.h"; 
scope="external" format="html"><filepath>udf-sample.h</filepath></xref>
+          within the sample build environment, which defines a simple function
           named <codeph>AddUdf()</codeph>:
         </p>
 
+<!-- Downloadable version of this file: 
https://raw.githubusercontent.com/cloudera/impala-udf-samples/master/udf-sample.h
 -->
 <codeblock>#ifndef IMPALA_UDF_SAMPLE_UDF_H
 #define IMPALA_UDF_SAMPLE_UDF_H
 
@@ -493,13 +508,15 @@ using namespace impala_udf;
 
 IntVal AddUdf(FunctionContext* context, const IntVal&amp; arg1, const 
IntVal&amp; arg2);
 
-#endif</codeblock>
+#endif
+</codeblock>
 
         <p>
           For sample C++ code for a simple function named 
<codeph>AddUdf()</codeph>, see the source file
           <filepath>udf-sample.cc</filepath> within the sample build 
environment:
         </p>
 
+<!-- Downloadable version of this file: 
https://raw.githubusercontent.com/cloudera/impala-udf-samples/master/udf-sample.cc
 -->
 <codeblock>#include "udf-sample.h"
 
 // In this sample we are declaring a UDF that adds two ints and returns an int.
@@ -522,7 +539,7 @@ IntVal AddUdf(FunctionContext* context, const IntVal&amp; 
arg1, const IntVal&amp
           Each value that a user-defined function can accept as an argument or 
return as a result value must map to
           a SQL data type that you could specify for a table column.
         </p>
- 
+
         <p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/>
 
         <p>
@@ -921,10 +938,10 @@ within UDAs, you can return without specifying a value.
         </p>
 
         <p>
-          <draft-comment translate="no">
-Need an example to demonstrate exactly what tokens are used for init, merge, 
finalize in
-this substitution.
-</draft-comment>
+          <!-- To do:
+            Need an example to demonstrate exactly what tokens are used for 
init, merge, finalize in
+            this substitution.
+          -->
           For convenience, you can use a naming convention for the underlying 
functions and Impala automatically
           recognizes those entry points. Specify the 
<codeph>UPDATE_FN</codeph> clause, using an entry point name
           containing the string <codeph>update</codeph> or 
<codeph>Update</codeph>. When you omit the other
@@ -943,56 +960,134 @@ this substitution.
           <filepath>uda-sample.h</filepath>:
         </p>
 
-        <p>
-          See this file online at:
-          <xref 
href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc"; 
scope="external" format="html"/>
-        </p>
+        <p> See this file online at: <xref
+            
href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.h";
+            scope="external" format="html" /></p>
 
-<codeblock audience="Cloudera">#ifndef IMPALA_UDF_SAMPLE_UDA_H
-#define IMPALA_UDF_SAMPLE_UDA_H
+<codeblock audience="Cloudera">#ifndef SAMPLES_UDA_H
+#define SAMPLES_UDA_H
 
 #include &lt;impala_udf/udf.h&gt;
 
 using namespace impala_udf;
 
 // This is an example of the COUNT aggregate function.
+//
+// Usage: &gt; create aggregate function my_count(int) returns bigint
+//          location '/user/cloudera/libudasample.so' update_fn='CountUpdate';
+//        &gt; select my_count(col) from tbl;
+
 void CountInit(FunctionContext* context, BigIntVal* val);
-void CountUpdate(FunctionContext* context, const AnyVal&amp; input, BigIntVal* 
val);
+void CountUpdate(FunctionContext* context, const IntVal&amp; input, BigIntVal* 
val);
 void CountMerge(FunctionContext* context, const BigIntVal&amp; src, BigIntVal* 
dst);
 BigIntVal CountFinalize(FunctionContext* context, const BigIntVal&amp; val);
 
+
 // This is an example of the AVG(double) aggregate function. This function 
needs to
 // maintain two pieces of state, the current sum and the count. We do this 
using
-// the BufferVal intermediate type. When this UDA is registered, it would 
specify
+// the StringVal intermediate type. When this UDA is registered, it would 
specify
 // 16 bytes (8 byte sum + 8 byte count) as the size for this buffer.
-void AvgInit(FunctionContext* context, BufferVal* val);
-void AvgUpdate(FunctionContext* context, const DoubleVal&amp; input, 
BufferVal* val);
-void AvgMerge(FunctionContext* context, const BufferVal&amp; src, BufferVal* 
dst);
-DoubleVal AvgFinalize(FunctionContext* context, const BufferVal&amp; val);
+//
+// Usage: &gt; create aggregate function my_avg(double) returns string 
+//          location '/user/cloudera/libudasample.so' update_fn='AvgUpdate';
+//        &gt; select cast(my_avg(col) as double) from tbl;
+
+void AvgInit(FunctionContext* context, StringVal* val);
+void AvgUpdate(FunctionContext* context, const DoubleVal&amp; input, 
StringVal* val);
+void AvgMerge(FunctionContext* context, const StringVal&amp; src, StringVal* 
dst);
+const StringVal AvgSerialize(FunctionContext* context, const StringVal&amp; 
val);
+StringVal AvgFinalize(FunctionContext* context, const StringVal&amp; val);
+
 
 // This is a sample of implementing the STRING_CONCAT aggregate function.
-// Example: select string_concat(string_col, ",") from table
+//
+// Usage: &gt; create aggregate function string_concat(string, string) returns 
string
+//          location '/user/cloudera/libudasample.so' 
update_fn='StringConcatUpdate';
+//        &gt; select string_concat(string_col, ",") from table;
+
 void StringConcatInit(FunctionContext* context, StringVal* val);
 void StringConcatUpdate(FunctionContext* context, const StringVal&amp; arg1,
     const StringVal&amp; arg2, StringVal* val);
 void StringConcatMerge(FunctionContext* context, const StringVal&amp; src, 
StringVal* dst);
+const StringVal StringConcatSerialize(FunctionContext* context, const 
StringVal&amp; val);
 StringVal StringConcatFinalize(FunctionContext* context, const StringVal&amp; 
val);
 
+
+// This is a example of the variance aggregate function.
+//
+// Usage: &gt; create aggregate function var(double) returns string
+//          location '/user/cloudera/libudasample.so' 
update_fn='VarianceUpdate';
+//        &gt; select cast(var(col) as double) from tbl;
+
+void VarianceInit(FunctionContext* context, StringVal* val);
+void VarianceUpdate(FunctionContext* context, const DoubleVal&amp; input, 
StringVal* val);
+void VarianceMerge(FunctionContext* context, const StringVal&amp; src, 
StringVal* dst);
+const StringVal VarianceSerialize(FunctionContext* context, const 
StringVal&amp; val);
+StringVal VarianceFinalize(FunctionContext* context, const StringVal&amp; val);
+
+
+// An implementation of the Knuth online variance algorithm, which is also 
single pass and
+// more numerically stable.
+//
+// Usage: &gt; create aggregate function knuth_var(double) returns string
+//          location '/user/cloudera/libudasample.so' 
update_fn='KnuthVarianceUpdate';
+//        &gt; select cast(knuth_var(col) as double) from tbl;
+
+void KnuthVarianceInit(FunctionContext* context, StringVal* val);
+void KnuthVarianceUpdate(FunctionContext* context, const DoubleVal&amp; input, 
StringVal* val);
+void KnuthVarianceMerge(FunctionContext* context, const StringVal&amp; src, 
StringVal* dst);
+const StringVal KnuthVarianceSerialize(FunctionContext* context, const 
StringVal&amp; val);
+StringVal KnuthVarianceFinalize(FunctionContext* context, const StringVal&amp; 
val);
+
+
+// The different steps of the UDA are composable. In this case, we'the UDA 
will use the
+// other steps from the Knuth variance computation.
+//
+// Usage: &gt; create aggregate function stddev(double) returns string
+//          location '/user/cloudera/libudasample.so' 
update_fn='KnuthVarianceUpdate'
+//          finalize_fn="StdDevFinalize";
+//        &gt; select cast(stddev(col) as double) from tbl;
+
+StringVal StdDevFinalize(FunctionContext* context, const StringVal&amp; val);
+
+
+// Utility function for serialization to StringVal
+template &lt;typename T&gt;
+StringVal ToStringVal(FunctionContext* context, const T&amp; val);
+
 #endif</codeblock>
 
         <p>
           <filepath>uda-sample.cc</filepath>:
         </p>
 
-        <p>
-          See this file online at:
-          <xref 
href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.h"; 
scope="external" format="html"/>
+        <p> See this file online at: <xref
+            
href="https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.cc";
+            scope="external" format="html" />
         </p>
 
 <codeblock audience="Cloudera">#include "uda-sample.h"
 #include &lt;assert.h&gt;
+#include &lt;sstream&gt;
 
 using namespace impala_udf;
+using namespace std;
+
+template &lt;typename T&gt;
+StringVal ToStringVal(FunctionContext* context, const T&amp; val) {
+  stringstream ss;
+  ss &lt;&lt; val;
+  string str = ss.str();
+  StringVal string_val(context, str.size());
+  memcpy(string_val.ptr, str.c_str(), str.size());
+  return string_val;
+}
+
+template &lt;&gt;
+StringVal ToStringVal&lt;DoubleVal&gt;(FunctionContext* context, const 
DoubleVal&amp; val) {
+  if (val.is_null) return StringVal::null();
+  return ToStringVal(context, val.val);
+}
 
 // ---------------------------------------------------------------------------
 // This is a sample of implementing a COUNT aggregate function.
@@ -1002,7 +1097,7 @@ void CountInit(FunctionContext* context, BigIntVal* val) {
   val-&gt;val = 0;
 }
 
-void CountUpdate(FunctionContext* context, const AnyVal&amp; input, BigIntVal* 
val) {
+void CountUpdate(FunctionContext* context, const IntVal&amp; input, BigIntVal* 
val) {
   if (input.is_null) return;
   ++val-&gt;val;
 }
@@ -1016,61 +1111,99 @@ BigIntVal CountFinalize(FunctionContext* context, const 
BigIntVal&amp; val) {
 }
 
 // ---------------------------------------------------------------------------
-// This is a sample of implementing an AVG aggregate function.
+// This is a sample of implementing a AVG aggregate function.
 // ---------------------------------------------------------------------------
 struct AvgStruct {
   double sum;
   int64_t count;
 };
 
-void AvgInit(FunctionContext* context, BufferVal* val) {
-  assert(sizeof(AvgStruct) == 16);
-  memset(*val, 0, sizeof(AvgStruct));
+// Initialize the StringVal intermediate to a zero'd AvgStruct
+void AvgInit(FunctionContext* context, StringVal* val) {
+  val-&gt;is_null = false;
+  val-&gt;len = sizeof(AvgStruct);
+  val-&gt;ptr = context-&gt;Allocate(val-&gt;len);
+  memset(val-&gt;ptr, 0, val-&gt;len);
 }
 
-void AvgUpdate(FunctionContext* context, const DoubleVal&amp; input, 
BufferVal* val) {
+void AvgUpdate(FunctionContext* context, const DoubleVal&amp; input, 
StringVal* val) {
   if (input.is_null) return;
-  AvgStruct* avg = reinterpret_cast&lt;AvgStruct*&gt;(*val);
+  assert(!val-&gt;is_null);
+  assert(val-&gt;len == sizeof(AvgStruct));
+  AvgStruct* avg = reinterpret_cast&lt;AvgStruct*&gt;(val-&gt;ptr);
   avg-&gt;sum += input.val;
   ++avg-&gt;count;
 }
 
-void AvgMerge(FunctionContext* context, const BufferVal&amp; src, BufferVal* 
dst) {
-  if (src == NULL) return;
-  const AvgStruct* src_struct = reinterpret_cast&lt;const AvgStruct*&gt;(src);
-  AvgStruct* dst_struct = reinterpret_cast&lt;AvgStruct*&gt;(*dst);
-  dst_struct-&gt;sum += src_struct-&gt;sum;
-  dst_struct-&gt;count += src_struct-&gt;count;
+void AvgMerge(FunctionContext* context, const StringVal&amp; src, StringVal* 
dst) {
+  if (src.is_null) return;
+  const AvgStruct* src_avg = reinterpret_cast&lt;const AvgStruct*&gt;(src.ptr);
+  AvgStruct* dst_avg = reinterpret_cast&lt;AvgStruct*&gt;(dst-&gt;ptr);
+  dst_avg-&gt;sum += src_avg-&gt;sum;
+  dst_avg-&gt;count += src_avg-&gt;count;
 }
 
-DoubleVal AvgFinalize(FunctionContext* context, const BufferVal&amp; val) {
-  if (val == NULL) return DoubleVal::null();
-  AvgStruct* val_struct = reinterpret_cast&lt;AvgStruct*&gt;(val);
-  return DoubleVal(val_struct-&gt;sum / val_struct-&gt;count);
+// A serialize function is necesary to free the intermediate state allocation. 
We use the
+// StringVal constructor to allocate memory owned by Impala, copy the 
intermediate state,
+// and free the original allocation. Note that memory allocated by the 
StringVal ctor is
+// not necessarily persisted across UDA function calls, which is why we don't 
use it in
+// AvgInit().
+const StringVal AvgSerialize(FunctionContext* context, const StringVal&amp; 
val) {
+  assert(!val.is_null);
+  StringVal result(context, val.len);
+  memcpy(result.ptr, val.ptr, val.len);
+  context-&gt;Free(val.ptr);
+  return result;
+}
+
+StringVal AvgFinalize(FunctionContext* context, const StringVal&amp; val) {
+  assert(!val.is_null);
+  assert(val.len == sizeof(AvgStruct));
+  AvgStruct* avg = reinterpret_cast&lt;AvgStruct*&gt;(val.ptr);
+  StringVal result;
+  if (avg-&gt;count == 0) {
+    result = StringVal::null();
+  } else {
+    // Copies the result to memory owned by Impala
+    result = ToStringVal(context, avg-&gt;sum / avg-&gt;count);
+  }
+  context-&gt;Free(val.ptr);
+  return result;
 }
 
 // ---------------------------------------------------------------------------
 // This is a sample of implementing the STRING_CONCAT aggregate function.
 // Example: select string_concat(string_col, ",") from table
 // ---------------------------------------------------------------------------
+// Delimiter to use if the separator is NULL.
+static const StringVal DEFAULT_STRING_CONCAT_DELIM((uint8_t*)", ", 2);
+
 void StringConcatInit(FunctionContext* context, StringVal* val) {
   val-&gt;is_null = true;
 }
 
-void StringConcatUpdate(FunctionContext* context, const StringVal&amp; arg1,
-    const StringVal&amp; arg2, StringVal* val) {
-  if (val-&gt;is_null) {
-    val-&gt;is_null = false;
-    *val = StringVal(context, arg1.len);
-    memcpy(val-&gt;ptr, arg1.ptr, arg1.len);
-  } else {
-    int new_len = val-&gt;len + arg1.len + arg2.len;
-    StringVal new_val(context, new_len);
-    memcpy(new_val.ptr, val-&gt;ptr, val-&gt;len);
-    memcpy(new_val.ptr + val-&gt;len, arg2.ptr, arg2.len);
-    memcpy(new_val.ptr + val-&gt;len + arg2.len, arg1.ptr, arg1.len);
-    *val = new_val;
+void StringConcatUpdate(FunctionContext* context, const StringVal&amp; str,
+    const StringVal&amp; separator, StringVal* result) {
+  if (str.is_null) return;
+  if (result-&gt;is_null) {
+    // This is the first string, simply set the result to be the value.
+    uint8_t* copy = context-&gt;Allocate(str.len);
+    memcpy(copy, str.ptr, str.len);
+    *result = StringVal(copy, str.len);
+    return;
   }
+
+  const StringVal* sep_ptr = separator.is_null ? 
&amp;DEFAULT_STRING_CONCAT_DELIM :
+      &amp;separator;
+
+  // We need to grow the result buffer and then append the new string and
+  // separator.
+  int new_size = result-&gt;len + sep_ptr-&gt;len + str.len;
+  result-&gt;ptr = context-&gt;Reallocate(result-&gt;ptr, new_size);
+  memcpy(result-&gt;ptr + result-&gt;len, sep_ptr-&gt;ptr, sep_ptr-&gt;len);
+  result-&gt;len += sep_ptr-&gt;len;
+  memcpy(result-&gt;ptr + result-&gt;len, str.ptr, str.len);
+  result-&gt;len += str.len;
 }
 
 void StringConcatMerge(FunctionContext* context, const StringVal&amp; src, 
StringVal* dst) {
@@ -1078,13 +1211,31 @@ void StringConcatMerge(FunctionContext* context, const 
StringVal&amp; src, Strin
   StringConcatUpdate(context, src, ",", dst);
 }
 
+// A serialize function is necesary to free the intermediate state allocation. 
We use the
+// StringVal constructor to allocate memory owned by Impala, copy the 
intermediate
+// StringVal, and free the intermediate's memory. Note that memory allocated 
by the
+// StringVal ctor is not necessarily persisted across UDA function calls, 
which is why we
+// don't use it in StringConcatUpdate().
+const StringVal StringConcatSerialize(FunctionContext* context, const 
StringVal&amp; val) {
+  if (val.is_null) return val;
+  StringVal result(context, val.len);
+  memcpy(result.ptr, val.ptr, val.len);
+  context-&gt;Free(val.ptr);
+  return result;
+}
+
+// Same as StringConcatSerialize().
 StringVal StringConcatFinalize(FunctionContext* context, const StringVal&amp; 
val) {
-  return val;
+  if (val.is_null) return val;
+  StringVal result(context, val.len);
+  memcpy(result.ptr, val.ptr, val.len);
+  context-&gt;Free(val.ptr);
+  return result;
 }</codeblock>
       </conbody>
     </concept>
 
-    <concept audience="Cloudera" id="udf_intermediate">
+    <concept rev="2.3.0 IMPALA-1829 CDH-30572" id="udf_intermediate">
 
       <title>Intermediate Results for UDAs</title>
 
@@ -1105,6 +1256,16 @@ StringVal StringConcatFinalize(FunctionContext* context, 
const StringVal&amp; va
           specify the type name as 
<codeph>CHAR(<varname>n</varname>)</codeph>, with <varname>n</varname>
           representing the number of bytes in the intermediate result buffer.
         </p>
+
+        <p>
+          For an example of this technique, see the 
<codeph>trunc_sum()</codeph> aggregate function, which accumulates
+          intermediate results of type <codeph>DOUBLE</codeph> and returns 
<codeph>BIGINT</codeph> at the end.
+          View
+          <xref 
href="https://github.com/cloudera/Impala/blob/cdh5-trunk/tests/query_test/test_udfs.py";
 scope="external" format="html">the <codeph>CREATE FUNCTION</codeph> 
statement</xref>
+          and
+          <xref 
href="http://github.com/Cloudera/Impala/blob/cdh5-trunk/be/src/testutil/test-udas.cc";
 scope="external" format="html">the implementation of the underlying 
TruncSum*() functions</xref>
+          on Github.
+        </p>
       </conbody>
     </concept>
   </concept>
@@ -1157,15 +1318,21 @@ StringVal StringConcatFinalize(FunctionContext* 
context, const StringVal&amp; va
 
       <note>
         <p 
conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
+        <p>
+          See <xref href="impala_create_function.xml#create_function"/> and 
<xref href="impala_drop_function.xml#drop_function"/>
+          for the new syntax for the persistent Java UDFs.
+        </p>
       </note>
 
       <p>
         Prerequisites for the build environment are:
       </p>
 
-<codeblock># Use the appropriate package installation command for your Linux 
distribution.
+<codeblock rev="CDH-37080"># Use the appropriate package installation command 
for your Linux distribution.
 sudo yum install gcc-c++ cmake boost-devel
-sudo yum install impala-udf-devel</codeblock>
+sudo yum install impala-udf-devel
+# The package name on Ubuntu and Debian is impala-udf-dev.
+</codeblock>
 
       <p>
         Then, unpack the sample code in 
<filepath>udf_samples.tar.gz</filepath> and use that as a template to set
@@ -1730,6 +1897,10 @@ Returned 2 row(s) in 0.43s</codeblock>
         </li>
 
         <li>
+          <p conref="../shared/impala_common.xml#common/current_user_caveat"/>
+        </li>
+
+        <li>
           All Impala UDFs must be deterministic, that is, produce the same 
output each time when passed the same
           argument values. For example, an Impala UDF must not call functions 
such as <codeph>rand()</codeph> to
           produce different values for each invocation. It must not retrieve 
data from external sources, such as
@@ -1740,9 +1911,12 @@ Returned 2 row(s) in 0.43s</codeblock>
           An Impala UDF must not spawn other threads or processes.
         </li>
 
-        <li>
-          When the <cmdname>catalogd</cmdname> process is restarted, all UDFs 
become undefined and must be
-          reloaded.
+        <li rev="2.5.0 IMPALA-2843">
+          Prior to CDH 5.7 / Impala 2.5, when the <cmdname>catalogd</cmdname> 
process is restarted,
+          all UDFs become undefined and must be reloaded. In CDH 5.7 / Impala 
2.5 and higher, this
+          limitation only applies to older Java UDFs. Re-create those UDFs 
using the new
+          <codeph>CREATE FUNCTION</codeph> syntax for Java UDFs, which 
excludes the function signature,
+          to remove the limitation entirely.
         </li>
 
         <li>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_union.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_union.xml b/docs/topics/impala_union.xml
index 29a0b45..ff4529f 100644
--- a/docs/topics/impala_union.xml
+++ b/docs/topics/impala_union.xml
@@ -8,6 +8,8 @@
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Querying"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>
 

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_update.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_update.xml b/docs/topics/impala_update.xml
index 3b9e330..a083c48 100644
--- a/docs/topics/impala_update.xml
+++ b/docs/topics/impala_update.xml
@@ -2,8 +2,8 @@
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="update">
 
-  <title>UPDATE Statement (CDH 5.5 and higher only)</title>
-  <titlealts><navtitle>UPDATE</navtitle></titlealts>
+  <title>UPDATE Statement (CDH 5.10 or higher only)</title>
+  <titlealts audience="PDF"><navtitle>UPDATE</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -12,6 +12,7 @@
       <data name="Category" value="ETL"/>
       <data name="Category" value="Ingest"/>
       <data name="Category" value="DML"/>
+      <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>
@@ -31,7 +32,7 @@
 <codeblock>
 </codeblock>
 
-    <p rev="kudu" audience="impala_next">
+    <p rev="kudu">
       Normally, an <codeph>UPDATE</codeph> operation for a Kudu table fails if
       some partition key columns are not found, due to their being deleted or 
changed
       by a concurrent <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> 
operation.

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_upgrading.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_upgrading.xml b/docs/topics/impala_upgrading.xml
index 6fef62e..0baaae6 100644
--- a/docs/topics/impala_upgrading.xml
+++ b/docs/topics/impala_upgrading.xml
@@ -3,7 +3,13 @@
 <concept id="upgrading">
 
   <title>Upgrading Impala</title>
-  
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Upgrading"/>
+      <data name="Category" value="Administrators"/>
+    </metadata>
+  </prolog>
 
   <conbody>
 
@@ -12,7 +18,361 @@
       tool to upgrade Impala to the latest version, and then restarting Impala 
services.
     </p>
 
-    
+    <note>
+      <ul>
+        <li>
+          Each version of CDH 5 has an associated version of Impala, When you 
upgrade from CDH 4 to CDH 5, you get
+          whichever version of Impala comes with the associated level of CDH. 
Depending on the version of Impala
+          you were running on CDH 4, this could install a lower level of 
Impala on CDH 5. For example, if you
+          upgrade to CDH 5.0 from CDH 4 plus Impala 1.4, the CDH 5.0 
installation comes with Impala 1.3. Always
+          check the associated level of Impala before upgrading to a specific 
version of CDH 5. Where practical,
+          upgrade from CDH 4 to the latest CDH 5, which also has the latest 
Impala.
+        </li>
+
+        <li rev="ver">
+          When you upgrade Impala, also upgrade Cloudera Manager if necessary:
+          <ul>
+            <li>
+              Users running Impala on CDH 5 must upgrade to Cloudera Manager 
5.0.0 or higher.
+            </li>
+
+            <li>
+              Users running Impala on CDH 4 must upgrade to Cloudera Manager 
4.8 or higher. Cloudera Manager 4.8
+              includes management support for the Impala catalog service, and 
is the minimum Cloudera Manager
+              version you can use.
+            </li>
+
+            <li>
+              Cloudera Manager is continually updated with configuration 
settings for features introduced in the
+              latest Impala releases.
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          If you are upgrading from CDH 5 beta to CDH 5.0 production, make 
sure you are using the appropriate CDH 5
+          repositories shown on the
+<!-- Original URL: 
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH-Version-and-Packaging-Information/CDH-Version-and-Packaging-Information.html
 -->
+          <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/rg_vd.html";
 scope="external" format="html">CDH
+          version and packaging</xref> page, then follow the procedures 
throughout the rest of this section.
+        </li>
+
+        <li>
+          Every time you upgrade to a new major or minor Impala release, see
+          <xref href="impala_incompatible_changes.xml#incompatible_changes"/> 
in the <cite>Release Notes</cite> for
+          any changes needed in your source code, startup scripts, and so on.
+        </li>
+
+        <li>
+          Also check <xref href="impala_known_issues.xml#known_issues"/> in 
the <cite>Release Notes</cite> for any
+          issues or limitations that require workarounds.
+        </li>
+
+      </ul>
+    </note>
+
+    <p outputclass="toc inpage"/>
+  </conbody>
+
+  <concept id="upgrade_cm_parcels">
+
+    <title>Upgrading Impala through Cloudera Manager - Parcels</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Cloudera Manager"/>
+      <data name="Category" value="Parcels"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        Parcels are an alternative binary distribution format available in 
Cloudera Manager 4.5 and higher.
+      </p>
+
+      <note type="important">
+        In CDH 5, there is not a separate Impala parcel; Impala is part of the 
main CDH 5 parcel. Each level of CDH
+        5 has a corresponding version of Impala, and you upgrade Impala by 
upgrading CDH. See the
+        <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_upgrading_cdh.html";
 scope="external" format="html">CDH
+        5 upgrade instructions</xref> and choose the instructions for parcels. 
The remainder of this section only covers parcel upgrades for
+        Impala under CDH 4.
+      </note>
+
+      <p>
+        To upgrade Impala for CDH 4 in a Cloudera Managed environment, using 
parcels:
+      </p>
+
+      <ol>
+        <li>
+          <p>
+            If you originally installed using packages and now are switching 
to parcels, remove all the
+            Impala-related packages first. You can check which packages are 
installed using one of the following
+            commands, depending on your operating system:
+          </p>
+<codeblock>rpm -qa               # RHEL, Oracle Linux, CentOS, Debian
+dpkg --get-selections # Debian</codeblock>
+          and then remove the packages using one of the following commands:
+<codeblock>sudo yum remove <varname>pkg_names</varname>    # RHEL, Oracle 
Linux, CentOS
+sudo zypper remove <varname>pkg_names</varname> # SLES
+sudo apt-get purge <varname>pkg_names</varname> # Ubuntu, Debian</codeblock>
+        </li>
+
+        <li>
+          <p>
+            Connect to the Cloudera Manager Admin Console.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Go to the 
<menucascade><uicontrol>Hosts</uicontrol><uicontrol>Parcels</uicontrol></menucascade>
 tab.
+            You should see a parcel with a newer version of Impala that you 
can upgrade to.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Click <uicontrol>Download</uicontrol>, then 
<uicontrol>Distribute</uicontrol>. (The button changes as
+            each step completes.)
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Click <uicontrol>Activate</uicontrol>.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            When prompted, click <uicontrol>Restart</uicontrol> to restart the 
Impala service.
+          </p>
+        </li>
+      </ol>
+    </conbody>
+  </concept>
+
+  <concept id="upgrade_cm_pkgs">
+
+    <title>Upgrading Impala through Cloudera Manager - Packages</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Packages"/>
+      <data name="Category" value="Cloudera Manager"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        To upgrade Impala in a Cloudera Managed environment, using packages:
+      </p>
+
+      <ol>
+        <li>
+          Connect to the Cloudera Manager Admin Console.
+        </li>
+
+        <li>
+          In the <b>Services</b> tab, click the <b>Impala</b> service.
+        </li>
+
+        <li>
+          Click <b>Actions</b> and click <b>Stop</b>.
+        </li>
+
+        <li>
+          Use <b>one</b> of the following sets of commands to update Impala on 
each Impala node in your cluster:
+          <p>
+            <b>For RHEL, Oracle Linux, or CentOS systems:</b>
+          </p>
+<codeblock rev="1.2">$ sudo yum update impala
+$ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already 
installed
+</codeblock>
+          <p>
+            <b>For SUSE systems:</b>
+          </p>
+<codeblock rev="1.2">$ sudo zypper update impala
+$ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already 
installed
+</codeblock>
+          <p>
+            <b>For Debian or Ubuntu systems:</b>
+          </p>
+<codeblock rev="1.2">$ sudo apt-get install impala
+$ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already 
installed
+</codeblock>
+        </li>
+
+        <li>
+          Use <b>one</b> of the following sets of commands to update Impala 
shell on each node on which it is
+          installed:
+          <p>
+            <b>For RHEL, Oracle Linux, or CentOS systems:</b>
+          </p>
+<codeblock>$ sudo yum update impala-shell</codeblock>
+          <p>
+            <b>For SUSE systems:</b>
+          </p>
+<codeblock>$ sudo zypper update impala-shell</codeblock>
+          <p>
+            <b>For Debian or Ubuntu systems:</b>
+          </p>
+<codeblock>$ sudo apt-get install impala-shell</codeblock>
+        </li>
+
+        <li>
+          Connect to the Cloudera Manager Admin Console.
+        </li>
+
+        <li>
+          In the <b>Services</b> tab, click the Impala service.
+        </li>
+
+        <li>
+          Click <b>Actions</b> and click <b>Start</b>.
+        </li>
+      </ol>
     </conbody>
   </concept>
 
+  <concept id="upgrade_noncm">
+
+    <title>Upgrading Impala without Cloudera Manager</title>
+  <prolog>
+    <metadata>
+      <!-- Fill in relevant metatag(s) when we decide how to mark non-CM 
topics. -->
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        To upgrade Impala on a cluster not managed by Cloudera Manager, run 
these Linux commands on the appropriate
+        hosts in your cluster:
+      </p>
+
+      <ol>
+        <li>
+          Stop Impala services.
+          <ol>
+            <li>
+              Stop <codeph>impalad</codeph> on each Impala node in your 
cluster:
+<codeblock>$ sudo service impala-server stop</codeblock>
+            </li>
+
+            <li>
+              Stop any instances of the state store in your cluster:
+<codeblock>$ sudo service impala-state-store stop</codeblock>
+            </li>
+
+            <li rev="1.2">
+              Stop any instances of the catalog service in your cluster:
+<codeblock>$ sudo service impala-catalog stop</codeblock>
+            </li>
+          </ol>
+        </li>
+
+        <li>
+          Check if there are new recommended or required configuration 
settings to put into place in the
+          configuration files, typically under 
<filepath>/etc/impala/conf</filepath>. See
+          <xref href="impala_config_performance.xml#config_performance"/> for 
settings related to performance and
+          scalability.
+        </li>
+
+        <li>
+          Use <b>one</b> of the following sets of commands to update Impala on 
each Impala node in your cluster:
+          <p>
+            <b>For RHEL, Oracle Linux, or CentOS systems:</b>
+          </p>
+<codeblock>$ sudo yum update impala-server
+$ sudo yum update hadoop-lzo-cdh4 # Optional; if this package is already 
installed
+$ sudo yum update impala-catalog # New in Impala 1.2; do yum install when 
upgrading from 1.1.
+</codeblock>
+          <p>
+            <b>For SUSE systems:</b>
+          </p>
+<codeblock>$ sudo zypper update impala-server
+$ sudo zypper update hadoop-lzo-cdh4 # Optional; if this package is already 
installed
+$ sudo zypper update impala-catalog # New in Impala 1.2; do zypper install 
when upgrading from 1.1.
+</codeblock>
+          <p>
+            <b>For Debian or Ubuntu systems:</b>
+          </p>
+<codeblock>$ sudo apt-get install impala-server
+$ sudo apt-get install hadoop-lzo-cdh4 # Optional; if this package is already 
installed
+$ sudo apt-get install impala-catalog # New in Impala 1.2.
+</codeblock>
+        </li>
+
+        <li>
+          Use <b>one</b> of the following sets of commands to update Impala 
shell on each node on which it is
+          installed:
+          <p>
+            <b>For RHEL, Oracle Linux, or CentOS systems:</b>
+          </p>
+<codeblock>$ sudo yum update impala-shell</codeblock>
+          <p>
+            <b>For SUSE systems:</b>
+          </p>
+<codeblock>$ sudo zypper update impala-shell</codeblock>
+          <p>
+            <b>For Debian or Ubuntu systems:</b>
+          </p>
+<codeblock>$ sudo apt-get install impala-shell</codeblock>
+        </li>
+
+        <li rev="alternatives">
+          Depending on which release of Impala you are upgrading from, you 
might find that the symbolic links
+          <filepath>/etc/impala/conf</filepath> and 
<filepath>/usr/lib/impala/sbin</filepath> are missing. If so,
+          see <xref href="impala_known_issues.xml#known_issues"/> for the 
procedure to work around this
+          problem.
+        </li>
+
+        <li>
+          Restart Impala services:
+          <ol>
+            <li>
+              Restart the Impala state store service on the desired nodes in 
your cluster. Expect to see a process
+              named <codeph>statestored</codeph> if the service started 
successfully.
+<codeblock>$ sudo service impala-state-store start
+$ ps ax | grep [s]tatestored
+ 6819 ?        Sl     0:07 /usr/lib/impala/sbin/statestored 
-log_dir=/var/log/impala -state_store_port=24000
+</codeblock>
+              <p>
+                Restart the state store service <i>before</i> the Impala 
server service to avoid <q>Not
+                connected</q> errors when you run 
<codeph>impala-shell</codeph>.
+              </p>
+            </li>
+
+            <li rev="1.2">
+              Restart the Impala catalog service on whichever host it runs on 
in your cluster. Expect to see a
+              process named <codeph>catalogd</codeph> if the service started 
successfully.
+<codeblock>$ sudo service impala-catalog restart
+$ ps ax | grep [c]atalogd
+ 6068 ?        Sl     4:06 /usr/lib/impala/sbin/catalogd
+</codeblock>
+            </li>
+
+            <li>
+              Restart the Impala daemon service on each node in your cluster. 
Expect to see a process named
+              <codeph>impalad</codeph> if the service started successfully.
+<codeblock>$ sudo service impala-server start
+$ ps ax | grep [i]mpalad
+ 7936 ?        Sl     0:12 /usr/lib/impala/sbin/impalad 
-log_dir=/var/log/impala -state_store_port=24000 -use_statestore
+-state_store_host=127.0.0.1 -be_port=22000
+</codeblock>
+            </li>
+          </ol>
+        </li>
+      </ol>
+
+      <note>
+        <p>
+          If the services did not start successfully (even though the 
<codeph>sudo service</codeph> command might
+          display <codeph>[OK]</codeph>), check for errors in the Impala log 
file, typically in
+          <filepath>/var/log/impala</filepath>.
+        </p>
+      </note>
+    </conbody>
+  </concept>
+</concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_use.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_use.xml b/docs/topics/impala_use.xml
index 9e0b654..5ffcdeb 100644
--- a/docs/topics/impala_use.xml
+++ b/docs/topics/impala_use.xml
@@ -3,12 +3,14 @@
 <concept id="use">
 
   <title>USE Statement</title>
-  <titlealts><navtitle>USE</navtitle></titlealts>
+  <titlealts audience="PDF"><navtitle>USE</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Databases"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>
 

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_v_cpu_cores.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_v_cpu_cores.xml 
b/docs/topics/impala_v_cpu_cores.xml
index 41be3af..8091f3a 100644
--- a/docs/topics/impala_v_cpu_cores.xml
+++ b/docs/topics/impala_v_cpu_cores.xml
@@ -3,6 +3,7 @@
 <concept rev="1.2" id="v_cpu_cores">
 
   <title>V_CPU_CORES Query Option (CDH 5 only)</title>
+  <titlealts audience="PDF"><navtitle>V_CPU_CORES</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -10,16 +11,19 @@
       <data name="Category" value="YARN"/>
       <data name="Category" value="Llama"/>
       <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>
 
   <conbody>
 
+    <note 
conref="../shared/impala_common.xml#common/llama_query_options_obsolete"/>
+
     <p>
       <indexterm audience="Cloudera">V_CPU_CORES query option</indexterm>
       The number of per-host virtual CPU cores to request from YARN. If set, 
the query option overrides the
       automatic estimate from Impala.
-<!-- This sentence is used in a few places and could be conref'ed. -->
       Used in conjunction with the Impala resource management feature in 
Impala 1.2 and higher and CDH 5.
     </p>
 
@@ -31,7 +35,5 @@
       <b>Default:</b> 0 (use automatic estimates)
     </p>
 
-<!-- Worth adding a couple of related info links here. -->
-
   </conbody>
 </concept>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_varchar.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_varchar.xml b/docs/topics/impala_varchar.xml
index 32db4ae..8b05149 100644
--- a/docs/topics/impala_varchar.xml
+++ b/docs/topics/impala_varchar.xml
@@ -3,7 +3,7 @@
 <concept id="varchar" rev="2.0.0">
 
   <title>VARCHAR Data Type (CDH 5.2 or higher only)</title>
-  <titlealts><navtitle>VARCHAR (CDH 5.2 or higher only)</navtitle></titlealts>
+  <titlealts audience="PDF"><navtitle>VARCHAR</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -17,7 +17,7 @@
 
   <conbody>
 
-    <p>
+    <p rev="2.0.0">
       <indexterm audience="Cloudera">VARCHAR data type</indexterm>
       A variable-length character type, truncated during processing if 
necessary to fit within the specified
       length.
@@ -80,6 +80,9 @@ prefer to use an integer data type with sufficient range 
(<codeph>INT</codeph>,
       Impala processes those values during a query.
     </p>
 
+    <p><b>Avro considerations:</b></p>
+    <p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
+
     <p conref="../shared/impala_common.xml#common/schema_evolution_blurb"/>
 
     <p>
@@ -98,8 +101,7 @@ prefer to use an integer data type with sufficient range 
(<codeph>INT</codeph>,
     <p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
 
     <p>
-      This type is available using Impala 2.0 or higher under CDH 4, or with 
Impala on CDH 5.2 or higher. There are
-      no compatibility issues with other components when exchanging data files 
or running Impala on CDH 4.
+      This type is available on CDH 5.2 or higher.
     </p>
 
     <p conref="../shared/impala_common.xml#common/internals_min_bytes"/>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_variance.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_variance.xml b/docs/topics/impala_variance.xml
index e0c5d02..4ce2eaf 100644
--- a/docs/topics/impala_variance.xml
+++ b/docs/topics/impala_variance.xml
@@ -3,7 +3,7 @@
 <concept rev="1.4" id="variance">
 
   <title>VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, VAR_POP 
Functions</title>
-  <titlealts><navtitle>VARIANCE, VARIANCE_SAMP, VARIANCE_POP, VAR_SAMP, 
VAR_POP</navtitle></titlealts>
+  <titlealts audience="PDF"><navtitle>VARIANCE, VARIANCE_SAMP, VARIANCE_POP, 
VAR_SAMP, VAR_POP</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -11,6 +11,8 @@
       <data name="Category" value="Impala Functions"/>
       <data name="Category" value="Aggregate Functions"/>
       <data name="Category" value="Querying"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>
 

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_views.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_views.xml b/docs/topics/impala_views.xml
index 78288b3..0b2154c 100644
--- a/docs/topics/impala_views.xml
+++ b/docs/topics/impala_views.xml
@@ -3,7 +3,7 @@
 <concept rev="1.1" id="views">
 
   <title>Overview of Impala Views</title>
-  <titlealts><navtitle>Views</navtitle></titlealts>
+  <titlealts audience="PDF"><navtitle>Views</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
@@ -13,6 +13,7 @@
       <data name="Category" value="Querying"/>
       <data name="Category" value="Tables"/>
       <data name="Category" value="Schemas"/>
+      <data name="Category" value="Views"/>
     </metadata>
   </prolog>
 
@@ -93,9 +94,9 @@ select * from report;</codeblock>
       <li rev="2.3.0 collevelauth">
         Set up fine-grained security where a user can query some columns from 
a table but not other columns.
         Because CDH 5.5 / Impala 2.3 and higher support column-level 
authorization, this technique is no longer
-        required. <!--If you formerly implemented column-level security 
through views, see
+        required. If you formerly implemented column-level security through 
views, see
         <xref 
href="sg_hive_sql.xml#concept_c2q_4qx_p4/col_level_auth_sentry"/> for details 
about the
-        column-level authorization feature.-->
+        column-level authorization feature.
         <!-- See <xref 
href="impala_authorization.xml#security_examples/sec_ex_views"/> for details. 
-->
       </li>
     </ul>

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3c2c8f12/docs/topics/impala_with.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_with.xml b/docs/topics/impala_with.xml
index 8d1001c..acc0f80 100644
--- a/docs/topics/impala_with.xml
+++ b/docs/topics/impala_with.xml
@@ -8,6 +8,8 @@
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Querying"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

[02/23] incubator-impala git commit: Update all impala* files to the latest CDH 5.9/5.10 versions.

Reply via email to