http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_txtfile.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_txtfile.xml b/docs/topics/impala_txtfile.xml
new file mode 100644
index 0000000..fc8f846
--- /dev/null
+++ b/docs/topics/impala_txtfile.xml
@@ -0,0 +1,812 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="txtfile">
+
+ <title>Using Text Data Files with Impala Tables</title>
+ <titlealts audience="PDF"><navtitle>Text Data Files</navtitle></titlealts>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Impala"/>
+ <data name="Category" value="File Formats"/>
+ <data name="Category" value="Tables"/>
+ <data name="Category" value="Developers"/>
+ <data name="Category" value="Data Analysts"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">Text support in Impala</indexterm>
+ Impala supports using text files as the storage format for input and
output. Text files are a
+ convenient format to use for interchange with other applications or
scripts that produce or read delimited
+ text files, such as CSV or TSV with commas or tabs for delimiters.
+ </p>
+
+ <p>
+ Text files are also very flexible in their column definitions. For
example, a text file could have more
+ fields than the Impala table, and those extra fields are ignored during
queries; or it could have fewer
+ fields than the Impala table, and those missing fields are treated as
<codeph>NULL</codeph> values in
+ queries. You could have fields that were treated as numbers or
timestamps in a table, then use <codeph>ALTER
+ TABLE ... REPLACE COLUMNS</codeph> to switch them to strings, or the
reverse.
+ </p>
+
+ <table>
+ <title>Text Format Support in Impala</title>
+ <tgroup cols="5">
+ <colspec colname="1" colwidth="10*"/>
+ <colspec colname="2" colwidth="10*"/>
+ <colspec colname="3" colwidth="20*"/>
+ <colspec colname="4" colwidth="30*"/>
+ <colspec colname="5" colwidth="30*"/>
+ <thead>
+ <row>
+ <entry>
+ File Type
+ </entry>
+ <entry>
+ Format
+ </entry>
+ <entry>
+ Compression Codecs
+ </entry>
+ <entry>
+ Impala Can CREATE?
+ </entry>
+ <entry>
+ Impala Can INSERT?
+ </entry>
+ </row>
+ </thead>
+ <tbody>
+ <row conref="impala_file_formats.xml#file_formats/txtfile_support">
+ <entry/>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+
+ <p outputclass="toc inpage"/>
+
+ </conbody>
+
+ <concept id="text_performance">
+
+ <title>Query Performance for Impala Text Tables</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Performance"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ Data stored in text format is relatively bulky, and not as efficient
to query as binary formats such as
+ Parquet. You typically use text tables with Impala if that is the
format you receive the data and you do
+ not have control over that process, or if you are a relatively new
Hadoop user and not familiar with
+ techniques to generate files in other formats. (Because the default
format for <codeph>CREATE
+ TABLE</codeph> is text, you might create your first Impala tables as
text without giving performance much
+ thought.) Either way, look for opportunities to use more efficient
file formats for the tables used in your
+ most performance-critical queries.
+ </p>
+
+ <p>
+ For frequently queried data, you might load the original text data
files into one Impala table, then use an
+ <codeph>INSERT</codeph> statement to transfer the data to another
table that uses the Parquet file format;
+ the data is converted automatically as it is stored in the destination
table.
+ </p>
+
+ <p>
+ For more compact data, consider using LZO compression for the text
files. LZO is the only compression codec
+ that Impala supports for text data, because the <q>splittable</q>
nature of LZO data files lets different
+ nodes work on different parts of the same file in parallel. See <xref
href="impala_txtfile.xml#lzo"/> for
+ details.
+ </p>
+
+ <p rev="2.0.0">
+ In Impala 2.0 and later, you can also use text data compressed in the
gzip, bzip2, or Snappy formats.
+ Because these compressed formats are not <q>splittable</q> in the way
that LZO is, there is less
+ opportunity for Impala to parallelize queries on them. Therefore, use
these types of compressed data only
+ for convenience if that is the format in which you receive the data.
Prefer to use LZO compression for text
+ data if you have the choice, or convert the data to Parquet using an
<codeph>INSERT ... SELECT</codeph>
+ statement to copy the original data into a Parquet table.
+ </p>
+
+ <note rev="2.2.0">
+ <p>
+ Impala supports bzip files created by the <codeph>bzip2</codeph>
command, but not bzip files with
+ multiple streams created by the <codeph>pbzip2</codeph> command.
Impala decodes only the data from the
+ first part of such files, leading to incomplete results.
+ </p>
+
+ <p>
+ The maximum size that Impala can accommodate for an individual bzip
file is 1 GB (after uncompression).
+ </p>
+ </note>
+
+ <p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="text_ddl">
+
+ <title>Creating Text Tables</title>
+
+ <conbody>
+
+ <p>
+ <b>To create a table using text data files:</b>
+ </p>
+
+ <p>
+ If the exact format of the text data files (such as the delimiter
character) is not significant, use the
+ <codeph>CREATE TABLE</codeph> statement with no extra clauses at the
end to create a text-format table. For
+ example:
+ </p>
+
+<codeblock>create table my_table(id int, s string, n int, t timestamp, b
boolean);
+</codeblock>
+
+ <p>
+ The data files created by any <codeph>INSERT</codeph> statements will
use the Ctrl-A character (hex 01) as
+ a separator between each column value.
+ </p>
+
+ <p>
+ A common use case is to import existing text files into an Impala
table. The syntax is more verbose; the
+ significant part is the <codeph>FIELDS TERMINATED BY</codeph> clause,
which must be preceded by the
+ <codeph>ROW FORMAT DELIMITED</codeph> clause. The statement can end
with a <codeph>STORED AS
+ TEXTFILE</codeph> clause, but that clause is optional because text
format tables are the default. For
+ example:
+ </p>
+
+<codeblock>create table csv(id int, s string, n int, t timestamp, b boolean)
+ row format delimited
+ <ph id="csv">fields terminated by ',';</ph>
+
+create table tsv(id int, s string, n int, t timestamp, b boolean)
+ row format delimited
+ <ph id="tsv">fields terminated by '\t';</ph>
+
+create table pipe_separated(id int, s string, n int, t timestamp, b boolean)
+ row format delimited
+ <ph id="psv">fields terminated by '|'</ph>
+ stored as textfile;
+</codeblock>
+
+ <p>
+ You can create tables with specific separator characters to import
text files in familiar formats such as
+ CSV, TSV, or pipe-separated. You can also use these tables to produce
output data files, by copying data
+ into them through the <codeph>INSERT ... SELECT</codeph> syntax and
then extracting the data files from the
+ Impala data directory.
+ </p>
+
+ <p rev="1.3.1">
+ In Impala 1.3.1 and higher, you can specify a delimiter character
<codeph>'\</codeph><codeph>0'</codeph> to
+ use the ASCII 0 (<codeph>nul</codeph>) character for text tables:
+ </p>
+
+<codeblock rev="1.3.1">create table nul_separated(id int, s string, n int, t
timestamp, b boolean)
+ row format delimited
+ fields terminated by '\0'
+ stored as textfile;
+</codeblock>
+
+ <note>
+ <p>
+ Do not surround string values with quotation marks in text data
files that you construct. If you need to
+ include the separator character inside a field value, for example to
put a string value with a comma
+ inside a CSV-format data file, specify an escape character on the
<codeph>CREATE TABLE</codeph> statement
+ with the <codeph>ESCAPED BY</codeph> clause, and insert that
character immediately before any separator
+ characters that need escaping.
+ </p>
+ </note>
+
+<!--
+ <p>
+ In the <cmdname>impala-shell</cmdname> interpreter, issue a command
similar to:
+ </p>
+
+<codeblock>create table textfile_table (<varname>column_specs</varname>)
stored as textfile;
+/* If the STORED AS clause is omitted, the default is a TEXTFILE with hex 01
characters as the delimiter. */
+create table default_table (<varname>column_specs</varname>);
+/* Some optional clauses in the CREATE TABLE statement apply only to Text
tables. */
+create table csv_table (<varname>column_specs</varname>) row format delimited
fields terminated by ',';
+create table tsv_table (<varname>column_specs</varname>) row format delimited
fields terminated by '\t';
+create table dos_table (<varname>column_specs</varname>) lines terminated by
'\r';</codeblock>
+-->
+
+ <p>
+ Issue a <codeph>DESCRIBE FORMATTED
<varname>table_name</varname></codeph> statement to see the details of
+ how each table is represented internally in Impala.
+ </p>
+
+ <p
conref="../shared/impala_common.xml#common/complex_types_unsupported_filetype"/>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="text_data_files">
+
+ <title>Data Files for Text Tables</title>
+
+ <conbody>
+
+ <p>
+ When Impala queries a table with data in text format, it consults all
the data files in the data directory
+ for that table, with some exceptions:
+ </p>
+
+ <ul rev="2.2.0">
+ <li>
+ <p>
+ Impala ignores any hidden files, that is, files whose names start
with a dot or an underscore.
+ </p>
+ </li>
+
+ <li>
+ <p
conref="../shared/impala_common.xml#common/ignore_file_extensions"/>
+ </li>
+
+ <li>
+<!-- Copied and slightly adapted text from later on in this same file. Turn
into a conref. -->
+ <p>
+ Impala uses suffixes to recognize when text data files are
compressed text. For Impala to recognize the
+ compressed text files, they must have the appropriate file
extension corresponding to the compression
+ codec, either <codeph>.gz</codeph>, <codeph>.bz2</codeph>, or
<codeph>.snappy</codeph>. The extensions
+ can be in uppercase or lowercase.
+ </p>
+ </li>
+
+ <li>
+ Otherwise, the file names are not significant. When you put files
into an HDFS directory through ETL
+ jobs, or point Impala to an existing HDFS directory with the
<codeph>CREATE EXTERNAL TABLE</codeph>
+ statement, or move data files under external control with the
<codeph>LOAD DATA</codeph> statement,
+ Impala preserves the original filenames.
+ </li>
+ </ul>
+
+ <p>
+ Filenames for data produced through Impala <codeph>INSERT</codeph>
statements are given unique names to
+ avoid filename conflicts.
+ </p>
+
+ <p>
+ An <codeph>INSERT ... SELECT</codeph> statement produces one data file
from each node that processes the
+ <codeph>SELECT</codeph> part of the statement. An <codeph>INSERT ...
VALUES</codeph> statement produces a
+ separate data file for each statement; because Impala is more
efficient querying a small number of huge
+ files than a large number of tiny files, the <codeph>INSERT ...
VALUES</codeph> syntax is not recommended
+ for loading a substantial volume of data. If you find yourself with a
table that is inefficient due to too
+ many small data files, reorganize the data into a few large files by
doing <codeph>INSERT ...
+ SELECT</codeph> to transfer the data to a new table.
+ </p>
+
+ <p>
+ <b>Special values within text data files:</b>
+ </p>
+
+ <ul>
+ <li rev="1.4.0">
+ <p>
+ Impala recognizes the literal strings <codeph>inf</codeph> for
infinity and <codeph>nan</codeph> for
+ <q>Not a Number</q>, for <codeph>FLOAT</codeph> and
<codeph>DOUBLE</codeph> columns.
+ </p>
+ </li>
+
+ <li>
+ <p>
+ Impala recognizes the literal string <codeph>\N</codeph> to
represent <codeph>NULL</codeph>. When using
+ Sqoop, specify the options <codeph>--null-non-string</codeph> and
<codeph>--null-string</codeph> to
+ ensure all <codeph>NULL</codeph> values are represented correctly
in the Sqoop output files. By default,
+ Sqoop writes <codeph>NULL</codeph> values using the string
<codeph>null</codeph>, which causes a
+ conversion error when such rows are evaluated by Impala. (A
workaround for existing tables and data files
+ is to change the table properties through <codeph>ALTER TABLE
<varname>name</varname> SET
+ TBLPROPERTIES("serialization.null.format"="null")</codeph>.)
+ </p>
+ </li>
+
+ <li>
+ <p conref="../shared/impala_common.xml#common/skip_header_lines"/>
+ </li>
+ </ul>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="text_etl">
+
+ <title>Loading Data into Impala Text Tables</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="ETL"/>
+ <data name="Category" value="Ingest"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ To load an existing text file into an Impala text table, use the
<codeph>LOAD DATA</codeph> statement and
+ specify the path of the file in HDFS. That file is moved into the
appropriate Impala data directory.
+ </p>
+
+ <p>
+ To load multiple existing text files into an Impala text table, use
the <codeph>LOAD DATA</codeph>
+ statement and specify the HDFS path of the directory containing the
files. All non-hidden files are moved
+ into the appropriate Impala data directory.
+ </p>
+
+ <p>
+ To convert data to text from any other file format supported by
Impala, use a SQL statement such as:
+ </p>
+
+<codeblock>-- Text table with default delimiter, the hex 01 character.
+CREATE TABLE text_table AS SELECT * FROM other_file_format_table;
+
+-- Text table with user-specified delimiter. Currently, you cannot specify
+-- the delimiter as part of CREATE TABLE LIKE or CREATE TABLE AS SELECT.
+-- But you can change an existing text table to have a different delimiter.
+CREATE TABLE csv LIKE other_file_format_table;
+ALTER TABLE csv SET SERDEPROPERTIES ('serialization.format'=',',
'field.delim'=',');
+INSERT INTO csv SELECT * FROM other_file_format_table;</codeblock>
+
+ <p>
+ This can be a useful technique to see how Impala represents special
values within a text-format data file.
+ Use the <codeph>DESCRIBE FORMATTED</codeph> statement to see the HDFS
directory where the data files are
+ stored, then use Linux commands such as <codeph>hdfs dfs -ls
<varname>hdfs_directory</varname></codeph> and
+ <codeph>hdfs dfs -cat <varname>hdfs_file</varname></codeph> to display
the contents of an Impala-created
+ text file.
+ </p>
+
+ <p>
+ To create a few rows in a text table for test purposes, you can use
the <codeph>INSERT ... VALUES</codeph>
+ syntax:
+ </p>
+
+<codeblock>INSERT INTO <varname>text_table</varname> VALUES
('string_literal',100,hex('hello world'));</codeblock>
+
+ <note>
+ Because Impala and the HDFS infrastructure are optimized for
multi-megabyte files, avoid the <codeph>INSERT
+ ... VALUES</codeph> notation when you are inserting many rows. Each
<codeph>INSERT ... VALUES</codeph>
+ statement produces a new tiny file, leading to fragmentation and
reduced performance. When creating any
+ substantial volume of new data, use one of the bulk loading techniques
such as <codeph>LOAD DATA</codeph>
+ or <codeph>INSERT ... SELECT</codeph>. Or, <xref
href="impala_hbase.xml#impala_hbase">use an HBase
+ table</xref> for single-row <codeph>INSERT</codeph> operations,
because HBase tables are not subject to the
+ same fragmentation issues as tables stored on HDFS.
+ </note>
+
+ <p>
+ When you create a text file for use with an Impala text table, specify
<codeph>\N</codeph> to represent a
+ <codeph>NULL</codeph> value. For the differences between
<codeph>NULL</codeph> and empty strings, see
+ <xref href="impala_literals.xml#null"/>.
+ </p>
+
+ <p>
+ If a text file has fewer fields than the columns in the corresponding
Impala table, all the corresponding
+ columns are set to <codeph>NULL</codeph> when the data in that file is
read by an Impala query.
+ </p>
+
+ <p>
+ If a text file has more fields than the columns in the corresponding
Impala table, the extra fields are
+ ignored when the data in that file is read by an Impala query.
+ </p>
+
+ <p>
+ You can also use manual HDFS operations such as <codeph>hdfs dfs
-put</codeph> or <codeph>hdfs dfs
+ -cp</codeph> to put data files in the data directory for an Impala
table. When you copy or move new data
+ files into the HDFS directory for the Impala table, issue a
<codeph>REFRESH
+ <varname>table_name</varname></codeph> statement in
<cmdname>impala-shell</cmdname> before issuing the next
+ query against that table, to make Impala recognize the newly added
files.
+ </p>
+
+ </conbody>
+
+ </concept>
+
+ <concept id="lzo">
+
+ <title>Using LZO-Compressed Text Files</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="LZO"/>
+ <data name="Category" value="Compression"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">LZO support in Impala</indexterm>
+
+ <indexterm audience="Cloudera">compression</indexterm>
+ Impala supports using text data files that employ LZO compression. <ph
rev="upstream">Cloudera</ph> recommends compressing
+ text data files when practical. Impala queries are usually I/O-bound;
reducing the amount of data read from
+ disk typically speeds up a query, despite the extra CPU work to
uncompress the data in memory.
+ </p>
+
+ <p>
+ Impala can work with LZO-compressed text files are preferable to files
compressed by other codecs, because
+ LZO-compressed files are <q>splittable</q>, meaning that different
portions of a file can be uncompressed
+ and processed independently by different nodes.
+ </p>
+
+ <p>
+ Impala does not currently support writing LZO-compressed text files.
+ </p>
+
+ <p>
+ Because Impala can query LZO-compressed files but currently cannot
write them, you use Hive to do the
+ initial <codeph>CREATE TABLE</codeph> and load the data, then switch
back to Impala to run queries. For
+ instructions on setting up LZO compression for Hive <codeph>CREATE
TABLE</codeph> and
+ <codeph>INSERT</codeph> statements, see
+ <xref
href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO"
scope="external" format="html">the
+ LZO page on the Hive wiki</xref>. Once you have created an LZO text
table, you can also manually add
+ LZO-compressed text files to it, produced by the
+ <xref href="http://www.lzop.org/" scope="external" format="html">
<cmdname>lzop</cmdname></xref> command
+ or similar method.
+ </p>
+
+ <section id="lzo_setup">
+
+ <title>Preparing to Use LZO-Compressed Text Files</title>
+
+ <p>
+ Before using LZO-compressed tables in Impala, do the following
one-time setup for each machine in the
+ cluster. Install the necessary packages using either the Cloudera
public repository, a private repository
+ you establish, or by using packages. You must do these steps
manually, whether or not the cluster is
+ managed by the Cloudera Manager product.
+ </p>
+
+ <ol>
+ <li>
+ <b>Prepare your systems to work with LZO by downloading and
installing the appropriate libraries:</b>
+ <p>
+ <b>On systems managed by Cloudera Manager using parcels:</b>
+ </p>
+
+ <p>
+ See the setup instructions for the LZO parcel in the Cloudera
Manager documentation for
+ <xref
href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_install_gpl_extras.html"
scope="external" format="html">Cloudera
+ Manager 5</xref>.
+ </p>
+
+ <p>
+ <b>On systems managed by Cloudera Manager using packages, or not
managed by Cloudera Manager:</b>
+ </p>
+
+ <p>
+ Download and install the appropriate file to each machine on
which you intend to use LZO with Impala.
+ These files all come from the Cloudera
+ <xref
href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/"
scope="external" format="html">GPL
+ extras</xref> download site. Install the:
+ </p>
+ <ul>
+ <li>
+ <xref
href="https://archive.cloudera.com/gplextras/redhat/5/x86_64/gplextras/cloudera-gplextras4.repo"
scope="external" format="repo">Red
+ Hat 5 repo file</xref> to
<filepath>/etc/yum.repos.d/</filepath>.
+ </li>
+
+ <li>
+ <xref
href="https://archive.cloudera.com/gplextras/redhat/6/x86_64/gplextras/cloudera-gplextras4.repo"
scope="external" format="repo">Red
+ Hat 6 repo file</xref> to
<filepath>/etc/yum.repos.d/</filepath>.
+ </li>
+
+ <li>
+ <xref
href="https://archive.cloudera.com/gplextras/sles/11/x86_64/gplextras/cloudera-gplextras4.repo"
scope="external" format="repo">SUSE
+ repo file</xref> to <filepath>/etc/zypp/repos.d/</filepath>.
+ </li>
+
+ <li>
+ <xref
href="https://archive.cloudera.com/gplextras/ubuntu/lucid/amd64/gplextras/cloudera.list"
scope="external" format="list">Ubuntu
+ 10.04 list file</xref> to
<filepath>/etc/apt/sources.list.d/</filepath>.
+ </li>
+
+ <li>
+ <xref
href="https://archive.cloudera.com/gplextras/ubuntu/precise/amd64/gplextras/cloudera.list"
scope="external" format="list">Ubuntu
+ 12.04 list file</xref> to
<filepath>/etc/apt/sources.list.d/</filepath>.
+ </li>
+
+ <li>
+ <xref
href="https://archive.cloudera.com/gplextras/debian/squeeze/amd64/gplextras/cloudera.list"
scope="external" format="list">Debian
+ list file</xref> to
<filepath>/etc/apt/sources.list.d/</filepath>.
+ </li>
+ </ul>
+ </li>
+
+ <li>
+ <b>Configure Impala to use LZO:</b>
+ <p>
+ Use <b>one</b> of the following sets of commands to refresh your
package management system's
+ repository information, install the base LZO support for Hadoop,
and install the LZO support for
+ Impala.
+ </p>
+
+ <note rev="1.2.0">
+ <p rev="1.2.0">
+ <ph rev="upstream">The name of the Hadoop LZO package changed
between CDH 4 and CDH 5. In CDH 4, the package name was
+ <codeph>hadoop-lzo-cdh4</codeph>. In CDH 5 and higher, the
package name is <codeph>hadoop-lzo</codeph>.</ph>
+ </p>
+ </note>
+
+ <p>
+ <b>For RHEL/CentOS systems:</b>
+ </p>
+<codeblock>$ sudo yum update
+$ sudo yum install hadoop-lzo
+$ sudo yum install impala-lzo</codeblock>
+ <p>
+ <b>For SUSE systems:</b>
+ </p>
+<codeblock rev="1.2">$ sudo apt-get update
+$ sudo zypper install hadoop-lzo
+$ sudo zypper install impala-lzo</codeblock>
+ <p>
+ <b>For Debian/Ubuntu systems:</b>
+ </p>
+<codeblock>$ sudo zypper update
+$ sudo apt-get install hadoop-lzo
+$ sudo apt-get install impala-lzo</codeblock>
+ <note>
+ <p>
+ The level of the <codeph>impala-lzo-cdh4</codeph> package is
closely tied to the version of Impala
+ you use. Any time you upgrade Impala, re-do the installation
command for
+ <codeph>impala-lzo</codeph> on each applicable machine to make
sure you have the appropriate
+ version of that package.
+ </p>
+ </note>
+ </li>
+
+ <li>
+ For <codeph>core-site.xml</codeph> on the client <b>and</b> server
(that is, in the configuration
+ directories for both Impala and Hadoop), append
<codeph>com.hadoop.compression.lzo.LzopCodec</codeph>
+ to the comma-separated list of codecs. For example:
+<codeblock><property>
+ <name>io.compression.codecs</name>
+
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,
+
org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,
+
org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec</value>
+</property></codeblock>
+ <note>
+ <p>
+ If this is the first time you have edited the Hadoop
<filepath>core-site.xml</filepath> file, note
+ that the <filepath>/etc/hadoop/conf</filepath> directory is
typically a symbolic link, so the
+ canonical <filepath>core-site.xml</filepath> might reside in a
different directory:
+ </p>
+<codeblock>$ ls -l /etc/hadoop
+total 8
+lrwxrwxrwx. 1 root root 29 Feb 26 2013 conf ->
/etc/alternatives/hadoop-conf
+lrwxrwxrwx. 1 root root 10 Feb 26 2013 conf.dist -> conf.empty
+drwxr-xr-x. 2 root root 4096 Feb 26 2013 conf.empty
+drwxr-xr-x. 2 root root 4096 Oct 28 15:46 conf.pseudo</codeblock>
+ <p>
+ If the <codeph>io.compression.codecs</codeph> property is
missing from
+ <filepath>core-site.xml</filepath>, only add
<codeph>com.hadoop.compression.lzo.LzopCodec</codeph>
+ to the new property value, not all the names from the
preceding example.
+ </p>
+ </note>
+ </li>
+
+ <li>
+ <!-- To do:
+ Link to CM or other doc where that procedure is explained.
+ Run through the procedure in CM and cite the relevant safety
valves to put the XML into.
+ -->
+ Restart the MapReduce and Impala services.
+ </li>
+ </ol>
+
+ </section>
+
+ <section id="lzo_create_table">
+
+ <title>Creating LZO Compressed Text Tables</title>
+
+ <p>
+ A table containing LZO-compressed text files must be created in Hive
with the following storage clause:
+ </p>
+
+<codeblock>STORED AS
+ INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
+ OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'</codeblock>
+
+<!--
+ <p>
+ In Hive, when writing LZO compressed text tables, you must include the
following specification:
+ </p>
+
+<codeblock>hive> SET hive.exec.compress.output=true;
+hive> SET
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;</codeblock>
+-->
+
+ <p>
+ Also, certain Hive settings need to be in effect. For example:
+ </p>
+
+<codeblock>hive> SET mapreduce.output.fileoutputformat.compress=true;
+hive> SET hive.exec.compress.output=true;
+hive> SET
mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
+hive> CREATE TABLE lzo_t (s string) STORED AS
+ > INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
+ > OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
+hive> INSERT INTO TABLE lzo_t SELECT col1, col2 FROM
uncompressed_text_table;</codeblock>
+
+ <p>
+ Once you have created LZO-compressed text tables, you can convert
data stored in other tables (regardless
+ of file format) by using the <codeph>INSERT ... SELECT</codeph>
statement in Hive.
+ </p>
+
+ <p>
+ Files in an LZO-compressed table must use the <codeph>.lzo</codeph>
extension. Examine the files in the
+ HDFS data directory after doing the <codeph>INSERT</codeph> in Hive,
to make sure the files have the
+ right extension. If the required settings are not in place, you end
up with regular uncompressed files,
+ and Impala cannot access the table because it finds data files with
the wrong (uncompressed) format.
+ </p>
+
+ <p>
+ After loading data into an LZO-compressed text table, index the
files so that they can be split. You
+ index the files by running a Java class,
+ <codeph>com.hadoop.compression.lzo.DistributedLzoIndexer</codeph>,
through the Linux command line. This
+ Java class is included in the <codeph>hadoop-lzo</codeph> package.
+ </p>
+
+ <p>
+ Run the indexer using a command like the following:
+ </p>
+
+<codeblock>$ hadoop jar
/usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
+ com.hadoop.compression.lzo.DistributedLzoIndexer
/hdfs_location_of_table/</codeblock>
+
+ <note>
+ If the path of the JAR file in the preceding example is not
recognized, do a <cmdname>find</cmdname>
+ command to locate <filepath>hadoop-lzo-*-gplextras.jar</filepath>
and use that path.
+ </note>
+
+ <p>
+ Indexed files have the same name as the file they index, with the
<codeph>.index</codeph> extension. If
+ the data files are not indexed, Impala queries still work, but the
queries read the data from remote
+ DataNodes, which is very inefficient.
+ </p>
+
+ <!-- To do:
+ Here is the place to put some end-to-end examples once I have it
+ all working. Or at least the final step with Impala queries.
+ Have never actually gotten this part working yet due to mismatches
+ between the levels of Impala and LZO packages.
+ -->
+
+ <p>
+ Once the LZO-compressed tables are created, and data is loaded and
indexed, you can query them through
+ Impala. As always, the first time you start
<cmdname>impala-shell</cmdname> after creating a table in
+ Hive, issue an <codeph>INVALIDATE METADATA</codeph> statement so
that Impala recognizes the new table.
+ (In Impala 1.2 and higher, you only have to run <codeph>INVALIDATE
METADATA</codeph> on one node, rather
+ than on all the Impala nodes.)
+ </p>
+
+ </section>
+
+ </conbody>
+
+ </concept>
+
+ <concept rev="2.0.0" id="gzip">
+
+ <title>Using gzip, bzip2, or Snappy-Compressed Text Files</title>
+ <prolog>
+ <metadata>
+ <data name="Category" value="Snappy"/>
+ <data name="Category" value="Gzip"/>
+ <data name="Category" value="Compression"/>
+ </metadata>
+ </prolog>
+
+ <conbody>
+
+ <p>
+ <indexterm audience="Cloudera">gzip support in Impala</indexterm>
+
+ <indexterm audience="Cloudera">bzip2 support in Impala</indexterm>
+
+ <indexterm audience="Cloudera">Snappy support in Impala</indexterm>
+
+ <indexterm audience="Cloudera">compression</indexterm>
+ In Impala 2.0 and later, Impala supports using text data files that
employ gzip, bzip2, or Snappy
+ compression. These compression types are primarily for convenience
within an existing ETL pipeline rather
+ than maximum performance. Although it requires less I/O to read
compressed text than the equivalent
+ uncompressed text, files compressed by these codecs are not
<q>splittable</q> and therefore cannot take
+ full advantage of the Impala parallel query capability.
+ </p>
+
+ <p>
+ As each bzip2- or Snappy-compressed text file is processed, the node
doing the work reads the entire file
+ into memory and then decompresses it. Therefore, the node must have
enough memory to hold both the
+ compressed and uncompressed data from the text file. The memory
required to hold the uncompressed data is
+ difficult to estimate in advance, potentially causing problems on
systems with low memory limits or with
+ resource management enabled. <ph rev="2.1.0">In Impala 2.1 and higher,
this memory overhead is reduced for
+ gzip-compressed text files. The gzipped data is decompressed as it is
read, rather than all at once.</ph>
+ </p>
+
+<!--
+ <p>
+ Impala can work with LZO-compressed text files but not GZip-compressed
text.
+ LZO-compressed files are <q>splittable</q>, meaning that different
portions of a file
+ can be uncompressed and processed independently by different nodes.
GZip-compressed
+ files are not splittable, making them unsuitable for Impala-style
distributed queries.
+ </p>
+-->
+
+ <p>
+ To create a table to hold gzip, bzip2, or Snappy-compressed text,
create a text table with no special
+ compression options. Specify the delimiter and escape character if
required, using the <codeph>ROW
+ FORMAT</codeph> clause.
+ </p>
+
+ <p>
+ Because Impala can query compressed text files but currently cannot
write them, produce the compressed text
+ files outside Impala and use the <codeph>LOAD DATA</codeph> statement,
manual HDFS commands to move them to
+ the appropriate Impala data directory. (Or, you can use <codeph>CREATE
EXTERNAL TABLE</codeph> and point
+ the <codeph>LOCATION</codeph> attribute at a directory containing
existing compressed text files.)
+ </p>
+
+ <p>
+ For Impala to recognize the compressed text files, they must have the
appropriate file extension
+ corresponding to the compression codec, either <codeph>.gz</codeph>,
<codeph>.bz2</codeph>, or
+ <codeph>.snappy</codeph>. The extensions can be in uppercase or
lowercase.
+ </p>
+
+ <p>
+ The following example shows how you can create a regular text table,
put different kinds of compressed and
+ uncompressed files into it, and Impala automatically recognizes and
decompresses each one based on their
+ file extensions:
+ </p>
+
+<codeblock>create table csv_compressed (a string, b string, c string)
+ row format delimited fields terminated by ",";
+
+insert into csv_compressed values
+ ('one - uncompressed', 'two - uncompressed', 'three - uncompressed'),
+ ('abc - uncompressed', 'xyz - uncompressed', '123 - uncompressed');
+...make equivalent .gz, .bz2, and .snappy files and load them into same table
directory...
+
+select * from csv_compressed;
++--------------------+--------------------+----------------------+
+| a | b | c |
++--------------------+--------------------+----------------------+
+| one - snappy | two - snappy | three - snappy |
+| one - uncompressed | two - uncompressed | three - uncompressed |
+| abc - uncompressed | xyz - uncompressed | 123 - uncompressed |
+| one - bz2 | two - bz2 | three - bz2 |
+| abc - bz2 | xyz - bz2 | 123 - bz2 |
+| one - gzip | two - gzip | three - gzip |
+| abc - gzip | xyz - gzip | 123 - gzip |
++--------------------+--------------------+----------------------+
+
+$ hdfs dfs -ls
'hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/';
+...truncated for readability...
+75
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed.snappy
+79
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_bz2.csv.bz2
+80
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/csv_compressed_gzip.csv.gz
+116
hdfs://127.0.0.1:8020/user/hive/warehouse/file_formats.db/csv_compressed/dd414df64d67d49b_data.0.
+</codeblock>
+
+ </conbody>
+
+ </concept>
+
+ <concept audience="Cloudera" id="txtfile_data_types">
+
+ <title>Data Type Considerations for Text Tables</title>
+
+ <conbody>
+
+ <p></p>
+
+ </conbody>
+
+ </concept>
+
+</concept>