[29/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

jbapple Thu, 17 Nov 2016 15:12:03 -0800
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_file_formats.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_file_formats.xml 
b/docs/topics/impala_file_formats.xml
new file mode 100644
index 0000000..48b9e7c
--- /dev/null
+++ b/docs/topics/impala_file_formats.xml
@@ -0,0 +1,270 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="file_formats">
+
+  <title>How Impala Works with Hadoop File Formats</title>
+  <titlealts audience="PDF"><navtitle>File Formats</navtitle></titlealts>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Concepts"/>
+      <data name="Category" value="Hadoop"/>
+      <data name="Category" value="File Formats"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+      <!-- Like Impala Administration, this page has a fair bit of info 
already, but it could benefit from wiki-style embedded of intro text from those 
other pages. -->
+      <!-- In this case, that would also enable a good in-page TOC since there 
is already one lonely subtopic on this same page. -->
+      <data name="Category" value="Stub Pages"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      <indexterm audience="Cloudera">file formats</indexterm>
+      <indexterm audience="Cloudera">compression</indexterm>
+      Impala supports several familiar file formats used in Apache Hadoop. 
Impala can load and query data files
+      produced by other Hadoop components such as Pig or MapReduce, and data 
files produced by Impala can be used
+      by other components also. The following sections discuss the procedures, 
limitations, and performance
+      considerations for using each file format with Impala.
+    </p>
+
+    <p>
+      The file format used for an Impala table has significant performance 
consequences. Some file formats include
+      compression support that affects the size of data on the disk and, 
consequently, the amount of I/O and CPU
+      resources required to deserialize data. The amounts of I/O and CPU 
resources required can be a limiting
+      factor in query performance since querying often begins with moving and 
decompressing data. To reduce the
+      potential impact of this part of the process, data is often compressed. 
By compressing data, a smaller total
+      number of bytes are transferred from disk to memory. This reduces the 
amount of time taken to transfer the
+      data, but a tradeoff occurs when the CPU decompresses the content.
+    </p>
+
+    <p>
+      Impala can query files encoded with most of the popular file formats and 
compression codecs used in Hadoop.
+      Impala can create and insert data into tables that use some file formats 
but not others; for file formats
+      that Impala cannot write to, create the table in Hive, issue the 
<codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
+      statement in <codeph>impala-shell</codeph>, and query the table through 
Impala. File formats can be
+      structured, in which case they may include metadata and built-in 
compression. Supported formats include:
+    </p>
+
+    <table>
+      <title>File Format Support in Impala</title>
+      <tgroup cols="5">
+        <colspec colname="1" colwidth="10*"/>
+        <colspec colname="2" colwidth="10*"/>
+        <colspec colname="3" colwidth="20*"/>
+        <colspec colname="4" colwidth="30*"/>
+        <colspec colname="5" colwidth="30*"/>
+        <thead>
+          <row>
+            <entry>
+              File Type
+            </entry>
+            <entry>
+              Format
+            </entry>
+            <entry>
+              Compression Codecs
+            </entry>
+            <entry>
+              Impala Can CREATE?
+            </entry>
+            <entry>
+              Impala Can INSERT?
+            </entry>
+          </row>
+        </thead>
+        <tbody>
+          <row id="parquet_support">
+            <entry>
+              <xref href="impala_parquet.xml#parquet">Parquet</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip; currently Snappy by default
+            </entry>
+            <entry>
+              Yes.
+            </entry>
+            <entry>
+              Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, 
<codeph>LOAD DATA</codeph>, and query.
+            </entry>
+          </row>
+          <row id="txtfile_support">
+            <entry>
+              <xref href="impala_txtfile.xml#txtfile">Text</xref>
+            </entry>
+            <entry>
+              Unstructured
+            </entry>
+            <entry rev="2.0.0">
+              LZO, gzip, bzip2, Snappy
+            </entry>
+            <entry>
+              Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED 
AS</codeph> clause, the default file
+              format is uncompressed text, with values separated by ASCII 
<codeph>0x01</codeph> characters
+              (typically represented as Ctrl-A).
+            </entry>
+            <entry>
+              Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, 
<codeph>LOAD DATA</codeph>, and query.
+              If LZO compression is used, you must create the table and load 
data in Hive. If other kinds of
+              compression are used, you must load data through <codeph>LOAD 
DATA</codeph>, Hive, or manually in
+              HDFS.
+
+<!--            <ph rev="2.0.0">Impala 2.0 and higher can write LZO-compressed 
text data; for earlier Impala releases,  you must create the table and load 
data in Hive.</ph> -->
+            </entry>
+          </row>
+          <row id="avro_support">
+            <entry>
+              <xref href="impala_avro.xml#avro">Avro</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip, deflate, bzip2
+            </entry>
+            <entry rev="1.4.0">
+              Yes, in Impala 1.4.0 and higher. Before that, create the table 
using Hive.
+            </entry>
+            <entry>
+              No. Import data by using <codeph>LOAD DATA</codeph> on data 
files already in the right format, or use
+              <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH 
<varname>table_name</varname></codeph> in Impala.
+            </entry>
+<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala 
releases, load data through <codeph>LOAD DATA</codeph> on data files already in 
the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
+          </row>
+          <row id="rcfile_support">
+            <entry>
+              <xref href="impala_rcfile.xml#rcfile">RCFile</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip, deflate, bzip2
+            </entry>
+            <entry>
+              Yes.
+            </entry>
+            <entry>
+              No. Import data by using <codeph>LOAD DATA</codeph> on data 
files already in the right format, or use
+              <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH 
<varname>table_name</varname></codeph> in Impala.
+            </entry>
+<!--
+            <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier 
Impala releases, load data through <codeph>LOAD DATA</codeph> on data files 
already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
+            -->
+          </row>
+          <row id="sequencefile_support">
+            <entry>
+              <xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
+            </entry>
+            <entry>
+              Structured
+            </entry>
+            <entry>
+              Snappy, gzip, deflate, bzip2
+            </entry>
+            <entry>Yes.</entry>
+            <entry>
+              No. Import data by using <codeph>LOAD DATA</codeph> on data 
files already in the right format, or use
+              <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH 
<varname>table_name</varname></codeph> in Impala.
+            </entry>
+<!--
+            <entry rev="2.0.0">
+              Yes, in Impala 2.0 and higher. For earlier Impala releases, load 
data through <codeph>LOAD
+              DATA</codeph> on data files already in the right format, or use 
<codeph>INSERT</codeph> in Hive.
+            </entry>
+-->
+          </row>
+        </tbody>
+      </tgroup>
+    </table>
+
+    <p rev="DOCS-1370">
+      Impala can only query the file formats listed in the preceding table.
+      In particular, Impala does not support the ORC file format.
+    </p>
+
+    <p>
+      Impala supports the following compression codecs:
+    </p>
+
+    <ul>
+      <li rev="2.0.0">
+        Snappy. Recommended for its effective balance between compression 
ratio and decompression speed. Snappy
+        compression is very fast, but gzip provides greater space savings. 
Supported for text files in Impala 2.0
+        and higher.
+<!-- Not supported for text files. -->
+      </li>
+
+      <li rev="2.0.0">
+        Gzip. Recommended when achieving the highest level of compression (and 
therefore greatest disk-space
+        savings) is desired. Supported for text files in Impala 2.0 and higher.
+      </li>
+
+      <li>
+        Deflate. Not supported for text files.
+      </li>
+
+      <li rev="2.0.0">
+        Bzip2. Supported for text files in Impala 2.0 and higher.
+<!-- Not supported for text files. -->
+      </li>
+
+      <li>
+        <p rev="2.0.0"> LZO, for text files only. Impala can query
+          LZO-compressed text tables, but currently cannot create them or 
insert
+          data into them; perform these operations in Hive. </p>
+      </li>
+    </ul>
+  </conbody>
+
+  <concept id="file_format_choosing">
+
+    <title>Choosing the File Format for a Table</title>
+  <prolog>
+    <metadata>
+      <data name="Category" value="Planning"/>
+    </metadata>
+  </prolog>
+
+    <conbody>
+
+      <p>
+        Different file formats and compression codecs work better for 
different data sets. While Impala typically
+        provides performance gains regardless of file format, choosing the 
proper format for your data can yield
+        further performance improvements. Use the following considerations to 
decide which combination of file
+        format and compression to use for a particular table:
+      </p>
+
+      <ul>
+        <li>
+          If you are working with existing files that are already in a 
supported file format, use the same format
+          for the Impala table where practical. If the original format does 
not yield acceptable query performance
+          or resource usage, consider creating a new Impala table with 
different file format or compression
+          characteristics, and doing a one-time conversion by copying the data 
to the new table using the
+          <codeph>INSERT</codeph> statement. Depending on the file format, you 
might run the
+          <codeph>INSERT</codeph> statement in <codeph>impala-shell</codeph> 
or in Hive.
+        </li>
+
+        <li>
+          Text files are convenient to produce through many different tools, 
and are human-readable for ease of
+          verification and debugging. Those characteristics are why text is 
the default format for an Impala
+          <codeph>CREATE TABLE</codeph> statement. When performance and 
resource usage are the primary
+          considerations, use one of the other file formats and consider using 
compression. A typical workflow
+          might involve bringing data into an Impala table by copying CSV or 
TSV files into the appropriate data
+          directory, and then using the <codeph>INSERT ... SELECT</codeph> 
syntax to copy the data into a table
+          using a different, more compact file format.
+        </li>
+
+        <li>
+          If your architecture involves storing data to be queried in memory, 
do not compress the data. There is no
+          I/O savings since the data does not need to be moved from disk, but 
there is a CPU cost to decompress the
+          data.
+        </li>
+      </ul>
+    </conbody>
+  </concept>
+</concept>
[29/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch.

Reply via email to