[02/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

jbapple Thu, 17 Nov 2016 15:12:03 -0800
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_txtfile.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_txtfile.xml b/docs/topics/impala_txtfile.xml
new file mode 100644
index 0000000..fc8f846
--- /dev/null
+++ b/docs/topics/impala_txtfile.xml
@@ -0,0 +1,812 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="txtfile">
+
+  <title>Using Text Data Files with Impala Tables</title>
+  <titlealts audience="PDF"><navtitle>Text Data Files</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="File Formats"/>
+      <data name="Category" value="Tables"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">Text support in Impala</indexterm>
+      Impala supports using text files as the storage format for input and 
output. Text files are a
+      convenient format to use for interchange with other applications or 
scripts that produce or read delimited
+      text files, such as CSV or TSV with commas or tabs for delimiters.
+    </p>
+
+    <p>
+      Text files are also very flexible in their column definitions. For 
example, a text file could have more
+      fields than the Impala table, and those extra fields are ignored during 
queries; or it could have fewer
+      fields than the Impala table, and those missing fields are treated as 
<codeph>NULL</codeph> values in
+      queries. You could have fields that were treated as numbers or 
timestamps in a table, then use <codeph>ALTER
+      TABLE ... REPLACE COLUMNS</codeph> to switch them to strings, or the 
reverse.
+    </p>
+
+    <table>
+      <title>Text Format Support in Impala</title>
+      <tgroup cols="5">
+        <colspec colname="1" colwidth="10*"/>
+        <colspec colname="2" colwidth="10*"/>
+        <colspec colname="3" colwidth="20*"/>
+        <colspec colname="4" colwidth="30*"/>
+        <colspec colname="5" colwidth="30*"/>
+        <thead>
+          <row>
+            <entry>
+              File Type
+            </entry>
+            <entry>
+              Format
+            </entry>
+            <entry>
+              Compression Codecs
+            </entry>
+            <entry>
+              Impala Can CREATE?
+            </entry>
+            <entry>
+              Impala Can INSERT?
+            </entry>
+          </row>
+        </thead>
+        <tbody>
+          <row conref="impala_file_formats.xml#file_formats/txtfile_support">
+            <entry/>
+          </row>
+        </tbody>
+      </tgroup>
+    </table>
+
+    <p outputclass="toc inpage"/>
+
+  </conbody>
+
+  <concept id="text_performance">
+
+    <title>Query Performance for Impala Text Tables</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Performance"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        Data stored in text format is relatively bulky, and not as efficient 
to query as binary formats such as
+        Parquet. You typically use text tables with Impala if that is the 
format you receive the data and you do
+        not have control over that process, or if you are a relatively new 
Hadoop user and not familiar with
+        techniques to generate files in other formats. (Because the default 
format for <codeph>CREATE
+        TABLE</codeph> is text, you might create your first Impala tables as 
text without giving performance much
+        thought.) Either way, look for opportunities to use more efficient 
file formats for the tables used in your
+        most performance-critical queries.
+      </p>
+
+      <p>
+        For frequently queried data, you might load the original text data 
files into one Impala table, then use an
+        <codeph>INSERT</codeph> statement to transfer the data to another 
table that uses the Parquet file format;
+        the data is converted automatically as it is stored in the destination 
table.
+      </p>
+
+      <p>
+        For more compact data, consider using LZO compression for the text 
files. LZO is the only compression codec
+        that Impala supports for text data, because the <q>splittable</q> 
nature of LZO data files lets different
+        nodes work on different parts of the same file in parallel. See <xref 
href="impala_txtfile.xml#lzo"/> for
+        details.
+      </p>
+
+      <p rev="2.0.0">
+        In Impala 2.0 and later, you can also use text data compressed in the 
gzip, bzip2, or Snappy formats.
+        Because these compressed formats are not <q>splittable</q> in the way 
that LZO is, there is less
+        opportunity for Impala to parallelize queries on them. Therefore, use 
these types of compressed data only
+        for convenience if that is the format in which you receive the data. 
Prefer to use LZO compression for text
+        data if you have the choice, or convert the data to Parquet using an 
<codeph>INSERT ... SELECT</codeph>
+        statement to copy the original data into a Parquet table.
+      </p>
+
+      <note rev="2.2.0">
+        <p>
+          Impala supports bzip files created by the <codeph>bzip2</codeph> 
command, but not bzip files with
+          multiple streams created by the <codeph>pbzip2</codeph> command. 
Impala decodes only the data from the
+          first part of such files, leading to incomplete results.
+        </p>
+
+        <p>
+          The maximum size that Impala can accommodate for an individual bzip 
file is 1 GB (after uncompression).
+        </p>
+      </note>
+
+      <p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="text_ddl">
+
+    <title>Creating Text Tables</title>
+
+    <conbody>
+
+      <p>
+        <b>To create a table using text data files:</b>
+      </p>
+
+      <p>
+        If the exact format of the text data files (such as the delimiter 
character) is not significant, use the
+        <codeph>CREATE TABLE</codeph> statement with no extra clauses at the 
end to create a text-format table. For
+        example:
+      </p>
+
+<codeblock>create table my_table(id int, s string, n int, t timestamp, b 
boolean);
+</codeblock>
+
+      <p>
+        The data files created by any <codeph>INSERT</codeph> statements will 
use the Ctrl-A character (hex 01) as
+        a separator between each column value.
+      </p>
+
+      <p>
+        A common use case is to import existing text files into an Impala 
table. The syntax is more verbose; the
+        significant part is the <codeph>FIELDS TERMINATED BY</codeph> clause, 
which must be preceded by the
+        <codeph>ROW FORMAT DELIMITED</codeph> clause. The statement can end 
with a <codeph>STORED AS
+        TEXTFILE</codeph> clause, but that clause is optional because text 
format tables are the default. For
+        example:
+      </p>
+
+<codeblock>create table csv(id int, s string, n int, t timestamp, b boolean)
+  row format delimited
+  <ph id="csv">fields terminated by ',';</ph>
+
+create table tsv(id int, s string, n int, t timestamp, b boolean)
+  row format delimited
+  <ph id="tsv">fields terminated by '\t';</ph>
+
+create table pipe_separated(id int, s string, n int, t timestamp, b boolean)
+  row format delimited
+  <ph id="psv">fields terminated by '|'</ph>
+  stored as textfile;
+</codeblock>
+
+      <p>
+        You can create tables with specific separator characters to import 
text files in familiar formats such as
+        CSV, TSV, or pipe-separated. You can also use these tables to produce 
output data files, by copying data
+        into them through the <codeph>INSERT ... SELECT</codeph> syntax and 
then extracting the data files from the
+        Impala data directory.
+      </p>
+
+      <p rev="1.3.1">
+        In Impala 1.3.1 and higher, you can specify a delimiter character 
<codeph>'\</codeph><codeph>0'</codeph> to
+        use the ASCII 0 (<codeph>nul</codeph>) character for text tables:
+      </p>
+
+<codeblock rev="1.3.1">create table nul_separated(id int, s string, n int, t 
timestamp, b boolean)
+  row format delimited
+  fields terminated by '\0'
+  stored as textfile;
+</codeblock>
+
+      <note>
+        <p>
+          Do not surround string values with quotation marks in text data 
files that you construct. If you need to
+          include the separator character inside a field value, for example to 
put a string value with a comma
+          inside a CSV-format data file, specify an escape character on the 
<codeph>CREATE TABLE</codeph> statement
+          with the <codeph>ESCAPED BY</codeph> clause, and insert that 
character immediately before any separator
+          characters that need escaping.
+        </p>
+      </note>
+
+<!--
+      <p>
+        In the <cmdname>impala-shell</cmdname> interpreter, issue a command 
similar to:
+      </p>
+
+<codeblock>create table textfile_table (<varname>column_specs</varname>) 
stored as textfile;
+/* If the STORED AS clause is omitted, the default is a TEXTFILE with hex 01 
characters as the delimiter. */
+create table default_table (<varname>column_specs</varname>);
+/* Some optional clauses in the CREATE TABLE statement apply only to Text 
tables. */
+create table csv_table (<varname>column_specs</varname>) row format delimited 
fields terminated by ',';
+create table tsv_table (<varname>column_specs</varname>) row format delimited 
fields terminated by '\t';
+create table dos_table (<varname>column_specs</varname>) lines terminated by 
'\r';</codeblock>
+-->
+
+      <p>
+        Issue a <codeph>DESCRIBE FORMATTED 
<varname>table_name</varname></codeph> statement to see the details of
+        how each table is represented internally in Impala.
+      </p>
+
+      <p 
conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="text_data_files">
+
+    <title>Data Files for Text Tables</title>
+
+    <conbody>
+
+      <p>
+        When Impala queries a table with data in text format, it consults all 
the data files in the data directory
+        for that table, with some exceptions:
+      </p>
+
+      <ul rev="2.2.0">
+        <li>
+          <p>
+            Impala ignores any hidden files, that is, files whose names start 
with a dot or an underscore.
+          </p>
+        </li>
+
+        <li>
+          <p 
conref="../shared/impala_common.xml#common/ignore_file_extensions"/>
+        </li>
+
+        <li>
+<!-- Copied and slightly adapted text from later on in this same file. Turn 
into a conref. -->
+          <p>
+            Impala uses suffixes to recognize when text data files are 
compressed text. For Impala to recognize the
+            compressed text files, they must have the appropriate file 
extension corresponding to the compression
+            codec, either <codeph>.gz</codeph>, <codeph>.bz2</codeph>, or 
<codeph>.snappy</codeph>. The extensions
+            can be in uppercase or lowercase.
+          </p>
+        </li>
+
+        <li>
+          Otherwise, the file names are not significant. When you put files 
into an HDFS directory through ETL
+          jobs, or point Impala to an existing HDFS directory with the 
<codeph>CREATE EXTERNAL TABLE</codeph>
+          statement, or move data files under external control with the 
<codeph>LOAD DATA</codeph> statement,
+          Impala preserves the original filenames.
+        </li>
+      </ul>
+
+      <p>
+        Filenames for data produced through Impala <codeph>INSERT</codeph> 
statements are given unique names to
+        avoid filename conflicts.
+      </p>
+
+      <p>
+        An <codeph>INSERT ... SELECT</codeph> statement produces one data file 
from each node that processes the
+        <codeph>SELECT</codeph> part of the statement. An <codeph>INSERT ... 
VALUES</codeph> statement produces a
+        separate data file for each statement; because Impala is more 
efficient querying a small number of huge
+        files than a large number of tiny files, the <codeph>INSERT ... 
VALUES</codeph> syntax is not recommended
+        for loading a substantial volume of data. If you find yourself with a 
table that is inefficient due to too
+        many small data files, reorganize the data into a few large files by 
doing <codeph>INSERT ...
+        SELECT</codeph> to transfer the data to a new table.
+      </p>
+
+      <p>
+        <b>Special values within text data files:</b>
+      </p>
+
+      <ul>
+        <li rev="1.4.0">
+          <p>
+            Impala recognizes the literal strings <codeph>inf</codeph> for 
infinity and <codeph>nan</codeph> for
+            <q>Not a Number</q>, for <codeph>FLOAT</codeph> and 
<codeph>DOUBLE</codeph> columns.
+          </p>
+        </li>
+
+        <li>
+          <p>
+            Impala recognizes the literal string <codeph>\N</codeph> to 
represent <codeph>NULL</codeph>. When using
+            Sqoop, specify the options <codeph>--null-non-string</codeph> and 
<codeph>--null-string</codeph> to
+            ensure all <codeph>NULL</codeph> values are represented correctly 
in the Sqoop output files. By default,
+            Sqoop writes <codeph>NULL</codeph> values using the string 
<codeph>null</codeph>, which causes a
+            conversion error when such rows are evaluated by Impala. (A 
workaround for existing tables and data files
+            is to change the table properties through <codeph>ALTER TABLE 
<varname>name</varname> SET
+            TBLPROPERTIES("serialization.null.format"="null")</codeph>.)
+          </p>
+        </li>
+
+        <li>
+          <p conref="../shared/impala_common.xml#common/skip_header_lines"/>
+        </li>
+      </ul>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="text_etl">
+
+    <title>Loading Data into Impala Text Tables</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="ETL"/>
+      <data name="Category" value="Ingest"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        To load an existing text file into an Impala text table, use the 
<codeph>LOAD DATA</codeph> statement and
+        specify the path of the file in HDFS. That file is moved into the 
appropriate Impala data directory.
+      </p>
+
+      <p>
+        To load multiple existing text files into an Impala text table, use 
the <codeph>LOAD DATA</codeph>
+        statement and specify the HDFS path of the directory containing the 
files. All non-hidden files are moved
+        into the appropriate Impala data directory.
+      </p>
+
+      <p>
+        To convert data to text from any other file format supported by 
Impala, use a SQL statement such as:
+      </p>
+
+<codeblock>-- Text table with default delimiter, the hex 01 character.
+CREATE TABLE text_table AS SELECT * FROM other_file_format_table;
+
+-- Text table with user-specified delimiter. Currently, you cannot specify
+-- the delimiter as part of CREATE TABLE LIKE or CREATE TABLE AS SELECT.
+-- But you can change an existing text table to have a different delimiter.
+CREATE TABLE csv LIKE other_file_format_table;
+ALTER TABLE csv SET SERDEPROPERTIES ('serialization.format'=',', 
'field.delim'=',');
+INSERT INTO csv SELECT * FROM other_file_format_table;</codeblock>
+
+      <p>
+        This can be a useful technique to see how Impala represents special 
values within a text-format data file.
+        Use the <codeph>DESCRIBE FORMATTED</codeph> statement to see the HDFS 
directory where the data files are
+        stored, then use Linux commands such as <codeph>hdfs dfs -ls 
<varname>hdfs_directory</varname></codeph> and
+        <codeph>hdfs dfs -cat <varname>hdfs_file</varname></codeph> to display 
the contents of an Impala-created
+        text file.
+      </p>
+
+      <p>
+        To create a few rows in a text table for test purposes, you can use 
the <codeph>INSERT ... VALUES</codeph>
+        syntax:
+      </p>
+
+<codeblock>INSERT INTO <varname>text_table</varname> VALUES 
('string_literal',100,hex('hello world'));</codeblock>
+
+      <note>
+        Because Impala and the HDFS infrastructure are optimized for 
multi-megabyte files, avoid the <codeph>INSERT
+        ... VALUES</codeph> notation when you are inserting many rows. Each 
<codeph>INSERT ... VALUES</codeph>
+        statement produces a new tiny file, leading to fragmentation and 
reduced performance. When creating any
+        substantial volume of new data, use one of the bulk loading techniques 
such as <codeph>LOAD DATA</codeph>
+        or <codeph>INSERT ... SELECT</codeph>. Or, <xref 
href="impala_hbase.xml#impala_hbase">use an HBase
+        table</xref> for single-row <codeph>INSERT</codeph> operations, 
because HBase tables are not subject to the
+        same fragmentation issues as tables stored on HDFS.
+      </note>
+
+      <p>
+        When you create a text file for use with an Impala text table, specify 
<codeph>\N</codeph> to represent a
+        <codeph>NULL</codeph> value. For the differences between 
<codeph>NULL</codeph> and empty strings, see
+        <xref href="impala_literals.xml#null"/>.
+      </p>
+
+      <p>
+        If a text file has fewer fields than the columns in the corresponding 
Impala table, all the corresponding
+        columns are set to <codeph>NULL</codeph> when the data in that file is 
read by an Impala query.
+      </p>
+
+      <p>
+        If a text file has more fields than the columns in the corresponding 
Impala table, the extra fields are
+        ignored when the data in that file is read by an Impala query.
+      </p>
+
+      <p>
+        You can also use manual HDFS operations such as <codeph>hdfs dfs 
-put</codeph> or <codeph>hdfs dfs
+        -cp</codeph> to put data files in the data directory for an Impala 
table. When you copy or move new data
+        files into the HDFS directory for the Impala table, issue a 
<codeph>REFRESH
+        <varname>table_name</varname></codeph> statement in 
<cmdname>impala-shell</cmdname> before issuing the next
+        query against that table, to make Impala recognize the newly added 
files.
+      </p>
+
+    </conbody>
+
+  </concept>
+
+  <concept id="lzo">
+
+    <title>Using LZO-Compressed Text Files</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="LZO"/>
+      <data name="Category" value="Compression"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">LZO support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">compression</indexterm>
+        Impala supports using text data files that employ LZO compression. <ph 
rev="upstream">Cloudera</ph> recommends compressing
+        text data files when practical. Impala queries are usually I/O-bound; 
reducing the amount of data read from
+        disk typically speeds up a query, despite the extra CPU work to 
uncompress the data in memory.
+      </p>
+
+      <p>
+        Impala can work with LZO-compressed text files are preferable to files 
compressed by other codecs, because
+        LZO-compressed files are <q>splittable</q>, meaning that different 
portions of a file can be uncompressed
+        and processed independently by different nodes.
+      </p>
+
+      <p>
+        Impala does not currently support writing LZO-compressed text files.
+      </p>
+
+      <p>
+        Because Impala can query LZO-compressed files but currently cannot 
write them, you use Hive to do the
+        initial <codeph>CREATE TABLE</codeph> and load the data, then switch 
back to Impala to run queries. For
+        instructions on setting up LZO compression for Hive <codeph>CREATE 
TABLE</codeph> and
+        <codeph>INSERT</codeph> statements, see
+        <xref 
href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO"; 
scope="external" format="html">the
+        LZO page on the Hive wiki</xref>. Once you have created an LZO text 
table, you can also manually add
+        LZO-compressed text files to it, produced by the
+        <xref href="http://www.lzop.org/"; scope="external" format="html"> 
<cmdname>lzop</cmdname></xref> command
+        or similar method.
+      </p>
+
+      <section id="lzo_setup">
+
+        <title>Preparing to Use LZO-Compressed Text Files</title>
+
+        <p>
+          Before using LZO-compressed tables in Impala, do the following 
one-time setup for each machine in the
+          cluster. Install the necessary packages using either the Cloudera 
public repository, a private repository
+          you establish, or by using packages. You must do these steps 
manually, whether or not the cluster is
+          managed by the Cloudera Manager product.
+        </p>
+
+        <ol>
+          <li>
+            <b>Prepare your systems to work with LZO by downloading and 
installing the appropriate libraries:</b>
+            <p>
+              <b>On systems managed by Cloudera Manager using parcels:</b>
+            </p>
+
+            <p>
+              See the setup instructions for the LZO parcel in the Cloudera 
Manager documentation for
+              <xref 
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_gpl_extras.html";
 scope="external" format="html">Cloudera
+              Manager 5</xref>.
+            </p>
+
+            <p>
+              <b>On systems managed by Cloudera Manager using packages, or not 
managed by Cloudera Manager:</b>
+            </p>
+
+            <p>
+              Download and install the appropriate file to each machine on 
which you intend to use LZO with Impala.
+              These files all come from the Cloudera
+              <xref 
href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/"; 
scope="external" format="html">GPL
+              extras</xref> download site. Install the:
+            </p>
+            <ul>
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/cloudera-gplextras4.repo";
 scope="external" format="repo">Red
+                Hat 5 repo file</xref> to 
<filepath>/etc/yum.repos.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo";
 scope="external" format="repo">Red
+                Hat 6 repo file</xref> to 
<filepath>/etc/yum.repos.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/sles/11/x86_64/gplextras/cloudera-gplextras4.repo";
 scope="external" format="repo">SUSE
+                repo file</xref> to <filepath>/etc/zypp/repos.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/ubuntu/lucid/amd64/gplextras/cloudera.list";
 scope="external" format="list">Ubuntu
+                10.04 list file</xref> to 
<filepath>/etc/apt/sources.list.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/ubuntu/precise/amd64/gplextras/cloudera.list";
 scope="external" format="list">Ubuntu
+                12.04 list file</xref> to 
<filepath>/etc/apt/sources.list.d/</filepath>.
+              </li>
+
+              <li>
+                <xref 
href="https://archive.cloudera.com/gplextras/debian/squeeze/amd64/gplextras/cloudera.list";
 scope="external" format="list">Debian
+                list file</xref> to 
<filepath>/etc/apt/sources.list.d/</filepath>.
+              </li>
+            </ul>
+          </li>
+
+          <li>
+            <b>Configure Impala to use LZO:</b>
+            <p>
+              Use <b>one</b> of the following sets of commands to refresh your 
package management system's
+              repository information, install the base LZO support for Hadoop, 
and install the LZO support for
+              Impala.
+            </p>
+
+            <note rev="1.2.0">
+              <p rev="1.2.0">
+                <ph rev="upstream">The name of the Hadoop LZO package changed 
between CDH 4 and CDH 5. In CDH 4, the package name was
+                <codeph>hadoop-lzo-cdh4</codeph>. In CDH 5 and higher, the 
package name is <codeph>hadoop-lzo</codeph>.</ph>
+              </p>
+            </note>
+
+            <p>
+              <b>For RHEL/CentOS systems:</b>
+            </p>
+<codeblock>$ sudo yum update
+$ sudo yum install hadoop-lzo
+$ sudo yum install impala-lzo</codeblock>
+            <p>
+              <b>For SUSE systems:</b>
+            </p>
+<codeblock rev="1.2">$ sudo apt-get update
+$ sudo zypper install hadoop-lzo
+$ sudo zypper install impala-lzo</codeblock>
+            <p>
+              <b>For Debian/Ubuntu systems:</b>
+            </p>
+<codeblock>$ sudo zypper update
+$ sudo apt-get install hadoop-lzo
+$ sudo apt-get install impala-lzo</codeblock>
+            <note>
+              <p>
+                The level of the <codeph>impala-lzo-cdh4</codeph> package is 
closely tied to the version of Impala
+                you use. Any time you upgrade Impala, re-do the installation 
command for
+                <codeph>impala-lzo</codeph> on each applicable machine to make 
sure you have the appropriate
+                version of that package.
+              </p>
+            </note>
+          </li>
+
+          <li>
+            For <codeph>core-site.xml</codeph> on the client <b>and</b> server 
(that is, in the configuration
+            directories for both Impala and Hadoop), append 
<codeph>com.hadoop.compression.lzo.LzopCodec</codeph>
+            to the comma-separated list of codecs. For example:
+<codeblock>&lt;property&gt;
+  &lt;name&gt;io.compression.codecs&lt;/name&gt;
+  
&lt;value&gt;org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,
+        
org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,
+        
org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec&lt;/value&gt;
+&lt;/property&gt;</codeblock>
+            <note>
+              <p>
+                If this is the first time you have edited the Hadoop 
<filepath>core-site.xml</filepath> file, note
+                that the <filepath>/etc/hadoop/conf</filepath> directory is 
typically a symbolic link, so the
+                canonical <filepath>core-site.xml</filepath> might reside in a 
different directory:
+              </p>
+<codeblock>$ ls -l /etc/hadoop
+total 8
+lrwxrwxrwx. 1 root root   29 Feb 26  2013 conf -&gt; 
/etc/alternatives/hadoop-conf
+lrwxrwxrwx. 1 root root   10 Feb 26  2013 conf.dist -&gt; conf.empty
+drwxr-xr-x. 2 root root 4096 Feb 26  2013 conf.empty
+drwxr-xr-x. 2 root root 4096 Oct 28 15:46 conf.pseudo</codeblock>
+              <p>
+                If the <codeph>io.compression.codecs</codeph> property is 
missing from
+                <filepath>core-site.xml</filepath>, only add 
<codeph>com.hadoop.compression.lzo.LzopCodec</codeph>
+                to the new property value, not all the names from the 
preceding example.
+              </p>
+            </note>
+          </li>
+
+          <li>
+            <!-- To do:
+              Link to CM or other doc where that procedure is explained.
+              Run through the procedure in CM and cite the relevant safety 
valves to put the XML into.
+            -->
+            Restart the MapReduce and Impala services.
+          </li>
+        </ol>
+
+      </section>
+
+      <section id="lzo_create_table">
+
+        <title>Creating LZO Compressed Text Tables</title>
+
+        <p>
+          A table containing LZO-compressed text files must be created in Hive 
with the following storage clause:
+        </p>
+
+<codeblock>STORED AS
+    INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
+    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'</codeblock>
+
+<!--
+      <p>
+        In Hive, when writing LZO compressed text tables, you must include the 
following specification:
+      </p>
+
+<codeblock>hive&gt; SET hive.exec.compress.output=true;
+hive&gt; SET 
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;</codeblock>
+-->
+
+        <p>
+          Also, certain Hive settings need to be in effect. For example:
+        </p>
+
+<codeblock>hive&gt; SET mapreduce.output.fileoutputformat.compress=true;
+hive&gt; SET hive.exec.compress.output=true;
+hive&gt; SET 
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
+hive&gt; CREATE TABLE lzo_t (s string) STORED AS
+  &gt; INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
+  &gt; OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
+hive&gt; INSERT INTO TABLE lzo_t SELECT col1, col2 FROM 
uncompressed_text_table;</codeblock>
+
+        <p>
+          Once you have created LZO-compressed text tables, you can convert 
data stored in other tables (regardless
+          of file format) by using the <codeph>INSERT ... SELECT</codeph> 
statement in Hive.
+        </p>
+
+        <p>
+          Files in an LZO-compressed table must use the <codeph>.lzo</codeph> 
extension. Examine the files in the
+          HDFS data directory after doing the <codeph>INSERT</codeph> in Hive, 
to make sure the files have the
+          right extension. If the required settings are not in place, you end 
up with regular uncompressed files,
+          and Impala cannot access the table because it finds data files with 
the wrong (uncompressed) format.
+        </p>
+
+        <p>
+          After loading data into an LZO-compressed text table, index the 
files so that they can be split. You
+          index the files by running a Java class,
+          <codeph>com.hadoop.compression.lzo.DistributedLzoIndexer</codeph>, 
through the Linux command line. This
+          Java class is included in the <codeph>hadoop-lzo</codeph> package.
+        </p>
+
+        <p>
+          Run the indexer using a command like the following:
+        </p>
+
+<codeblock>$ hadoop jar 
/usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
+  com.hadoop.compression.lzo.DistributedLzoIndexer 
/hdfs_location_of_table/</codeblock>
+
+        <note>
+          If the path of the JAR file in the preceding example is not 
recognized, do a <cmdname>find</cmdname>
+          command to locate <filepath>hadoop-lzo-*-gplextras.jar</filepath> 
and use that path.
+        </note>
+
+        <p>
+          Indexed files have the same name as the file they index, with the 
<codeph>.index</codeph> extension. If
+          the data files are not indexed, Impala queries still work, but the 
queries read the data from remote
+          DataNodes, which is very inefficient.
+        </p>
+
+        <!-- To do:
+          Here is the place to put some end-to-end examples once I have it
+          all working. Or at least the final step with Impala queries.
+          Have never actually gotten this part working yet due to mismatches
+          between the levels of Impala and LZO packages.
+        -->
+
+        <p>
+          Once the LZO-compressed tables are created, and data is loaded and 
indexed, you can query them through
+          Impala. As always, the first time you start 
<cmdname>impala-shell</cmdname> after creating a table in
+          Hive, issue an <codeph>INVALIDATE METADATA</codeph> statement so 
that Impala recognizes the new table.
+          (In Impala 1.2 and higher, you only have to run <codeph>INVALIDATE 
METADATA</codeph> on one node, rather
+          than on all the Impala nodes.)
+        </p>
+
+      </section>
+
+    </conbody>
+
+  </concept>
+
+  <concept rev="2.0.0" id="gzip">
+
+    <title>Using gzip, bzip2, or Snappy-Compressed Text Files</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Snappy"/>
+      <data name="Category" value="Gzip"/>
+      <data name="Category" value="Compression"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        <indexterm audience="Cloudera">gzip support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">bzip2 support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">Snappy support in Impala</indexterm>
+
+        <indexterm audience="Cloudera">compression</indexterm>
+        In Impala 2.0 and later, Impala supports using text data files that 
employ gzip, bzip2, or Snappy
+        compression. These compression types are primarily for convenience 
within an existing ETL pipeline rather
+        than maximum performance. Although it requires less I/O to read 
compressed text than the equivalent
+        uncompressed text, files compressed by these codecs are not 
<q>splittable</q> and therefore cannot take
+        full advantage of the Impala parallel query capability.
+      </p>
+
+      <p>
+        As each bzip2- or Snappy-compressed text file is processed, the node 
doing the work reads the entire file
+        into memory and then decompresses it. Therefore, the node must have 
enough memory to hold both the
+        compressed and uncompressed data from the text file. The memory 
required to hold the uncompressed data is
+        difficult to estimate in advance, potentially causing problems on 
systems with low memory limits or with
+        resource management enabled. <ph rev="2.1.0">In Impala 2.1 and higher, 
this memory overhead is reduced for
+        gzip-compressed text files. The gzipped data is decompressed as it is 
read, rather than all at once.</ph>
+      </p>
+
+<!--
+    <p>
+    Impala can work with LZO-compressed text files but not GZip-compressed 
text.
+    LZO-compressed files are <q>splittable</q>, meaning that different 
portions of a file
+    can be uncompressed and processed independently by different nodes. 
GZip-compressed
+    files are not splittable, making them unsuitable for Impala-style 
distributed queries.
+    </p>
+-->
+
+      <p>
+        To create a table to hold gzip, bzip2, or Snappy-compressed text, 
create a text table with no special
+        compression options. Specify the delimiter and escape character if 
required, using the <codeph>ROW
+        FORMAT</codeph> clause.
+      </p>
+
+      <p>
+        Because Impala can query compressed text files but currently cannot 
write them, produce the compressed text
+        files outside Impala and use the <codeph>LOAD DATA</codeph> statement, 
manual HDFS commands to move them to
+        the appropriate Impala data directory. (Or, you can use <codeph>CREATE 
EXTERNAL TABLE</codeph> and point
+        the <codeph>LOCATION</codeph> attribute at a directory containing 
existing compressed text files.)
+      </p>
+
+      <p>
+        For Impala to recognize the compressed text files, they must have the 
appropriate file extension
+        corresponding to the compression codec, either <codeph>.gz</codeph>, 
<codeph>.bz2</codeph>, or
+        <codeph>.snappy</codeph>. The extensions can be in uppercase or 
lowercase.
+      </p>
+
+      <p>
+        The following example shows how you can create a regular text table, 
put different kinds of compressed and
+        uncompressed files into it, and Impala automatically recognizes and 
decompresses each one based on their
+        file extensions:
+      </p>
+
+<codeblock>create table csv_compressed (a string, b string, c string)
+  row format delimited fields terminated by ",";
+
+insert into csv_compressed values
+  ('one - uncompressed', 'two - uncompressed', 'three - uncompressed'),
+  ('abc - uncompressed', 'xyz - uncompressed', '123 - uncompressed');
+...make equivalent .gz, .bz2, and .snappy files and load them into same table 
directory...
+
+select * from csv_compressed;
++--------------------+--------------------+----------------------+
+| a                  | b                  | c                    |
++--------------------+--------------------+----------------------+
+| one - snappy       | two - snappy       | three - snappy       |
+| one - uncompressed | two - uncompressed | three - uncompressed |
+| abc - uncompressed | xyz - uncompressed | 123 - uncompressed   |
+| one - bz2          | two - bz2          | three - bz2          |
+| abc - bz2          | xyz - bz2          | 123 - bz2            |
+| one - gzip         | two - gzip         | three - gzip         |
+| abc - gzip         | xyz - gzip         | 123 - gzip           |
++--------------------+--------------------+----------------------+
+
+$ hdfs dfs -ls 
'hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/';
+...truncated for readability...
+75 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed.snappy
+79 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_bz2.csv.bz2
+80 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_gzip.csv.gz
+116 
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/dd414df64d67d49b_data.0.
+</codeblock>
+
+    </conbody>
+
+  </concept>
+
+  <concept audience="Cloudera" id="txtfile_data_types">
+
+    <title>Data Type Considerations for Text Tables</title>
+
+    <conbody>
+
+      <p></p>
+
+    </conbody>
+
+  </concept>
+
+</concept>
[02/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

Reply via email to