[17/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

jbapple Wed, 12 Apr 2017 11:25:51 -0700

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_parquet.html 
b/docs/build/html/topics/impala_parquet.html
new file mode 100644
index 0000000..894c97a
--- /dev/null
+++ b/docs/build/html/topics/impala_parquet.html
@@ -0,0 +1,1392 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_file_formats.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content=
 "Impala"><meta name="prodname" content="Impala"><meta name="prodname" 
content="Impala"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="version" content="Impala 2.8.x"><meta name="version" content="Impala 
2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" 
content="XHTML"><meta name="DC.Identifier" content=
 "parquet"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>Using the Parquet File Format with Impala 
Tables</title></head><body id="parquet"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">Using the Parquet File 
Format with Impala Tables</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Impala helps you to create, manage, and query Parquet tables. Parquet is 
a column-oriented binary file format
+      intended to be highly efficient for the types of large-scale queries 
that Impala is best at. Parquet is
+      especially good for queries scanning particular columns within a table, 
for example to query <span class="q">"wide"</span>
+      tables with many columns, or to perform aggregation operations such as 
<code class="ph codeph">SUM()</code> and
+      <code class="ph codeph">AVG()</code> that need to process most or all of 
the values from a column. Each data file contains
+      the values for a set of rows (the <span class="q">"row group"</span>). 
Within a data file, the values from each column are
+      organized so that they are all adjacent, enabling good compression for 
the values from that column. Queries
+      against a Parquet table can retrieve and analyze these values from any 
column quickly and with minimal I/O.
+    </p>
+
+    <table class="table"><caption><span class="table--title-label">Table 1. 
</span><span class="title">Parquet Format Support in 
Impala</span></caption><colgroup><col style="width:10%"><col 
style="width:10%"><col style="width:20%"><col style="width:30%"><col 
style="width:30%"></colgroup><thead class="thead">
+          <tr class="row">
+            <th class="entry nocellnorowborder" id="parquet__entry__1">
+              File Type
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__2">
+              Format
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__3">
+              Compression Codecs
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__4">
+              Impala Can CREATE?
+            </th>
+            <th class="entry nocellnorowborder" id="parquet__entry__5">
+              Impala Can INSERT?
+            </th>
+          </tr>
+        </thead><tbody class="tbody">
+          <tr class="row">
+            <td class="entry nocellnorowborder" headers="parquet__entry__1 ">
+              <a class="xref" href="impala_parquet.html#parquet">Parquet</a>
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__2 ">
+              Structured
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__3 ">
+              Snappy, gzip; currently Snappy by default
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__4 ">
+              Yes.
+            </td>
+            <td class="entry nocellnorowborder" headers="parquet__entry__5 ">
+              Yes: <code class="ph codeph">CREATE TABLE</code>, <code 
class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and 
query.
+            </td>
+          </tr>
+        </tbody></table>
+
+    <p class="p toc inpage"></p>
+
+  </div>
+
+
+  <nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_file_formats.html">How Impala Works with Hadoop File 
Formats</a></div></div></nav><article class="topic concept nested1" 
aria-labelledby="ariaid-title2" id="parquet__parquet_ddl">
+
+    <h2 class="title topictitle2" id="ariaid-title2">Creating Parquet Tables 
in Impala</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        To create a table named <code class="ph codeph">PARQUET_TABLE</code> 
that uses the Parquet format, you would use a
+        command like the following, substituting your own table name, column 
names, and data types:
+      </p>
+
+<pre class="pre codeblock"><code>[impala-host:21000] &gt; create table <var 
class="keyword varname">parquet_table_name</var> (x INT, y STRING) STORED AS 
PARQUET;</code></pre>
+
+
+
+      <p class="p">
+        Or, to clone the column names and data types of an existing table:
+      </p>
+
+<pre class="pre codeblock"><code>[impala-host:21000] &gt; create table <var 
class="keyword varname">parquet_table_name</var> LIKE <var class="keyword 
varname">other_table_name</var> STORED AS PARQUET;</code></pre>
+
+      <p class="p">
+        In Impala 1.4.0 and higher, you can derive column definitions from a 
raw Parquet data file, even without an
+        existing Impala table. For example, you can create an external table 
pointing to an HDFS directory, and
+        base the column definitions on one of the files in that directory:
+      </p>
+
+<pre class="pre codeblock"><code>CREATE EXTERNAL TABLE ingest_existing_files 
LIKE PARQUET '/user/etl/destination/datafile1.dat'
+  STORED AS PARQUET
+  LOCATION '/user/etl/destination';
+</code></pre>
+
+      <p class="p">
+        Or, you can refer to an existing data file and create a new empty 
table with suitable column definitions.
+        Then you can use <code class="ph codeph">INSERT</code> to create new 
data files or <code class="ph codeph">LOAD DATA</code> to transfer
+        existing data files into the new table.
+      </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE 
PARQUET '/user/etl/destination/datafile1.dat'
+  STORED AS PARQUET;
+</code></pre>
+
+      <p class="p">
+        The default properties of the newly created table are the same as for 
any other <code class="ph codeph">CREATE
+        TABLE</code> statement. For example, the default file format is text; 
if you want the new table to use
+        the Parquet file format, include the <code class="ph codeph">STORED AS 
PARQUET</code> file also.
+      </p>
+
+      <p class="p">
+        In this example, the new table is partitioned by year, month, and day. 
These partition key columns are not
+        part of the data file, so you specify them in the <code class="ph 
codeph">CREATE TABLE</code> statement:
+      </p>
+
+<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE 
PARQUET '/user/etl/destination/datafile1.dat'
+  PARTITION (year INT, month TINYINT, day TINYINT)
+  STORED AS PARQUET;
+</code></pre>
+
+      <p class="p">
+        See <a class="xref" 
href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for 
more details about the <code class="ph codeph">CREATE TABLE
+        LIKE PARQUET</code> syntax.
+      </p>
+
+      <p class="p">
+        Once you have created a table, to insert data into that table, use a 
command similar to the following,
+        again with your own table names:
+      </p>
+
+      
+
+<pre class="pre codeblock"><code>[impala-host:21000] &gt; insert overwrite 
table <var class="keyword varname">parquet_table_name</var> select * from <var 
class="keyword varname">other_table_name</var>;</code></pre>
+
+      <p class="p">
+        If the Parquet table has a different number of columns or different 
column names than the other table,
+        specify the names of columns from the other table rather than <code 
class="ph codeph">*</code> in the
+        <code class="ph codeph">SELECT</code> statement.
+      </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title3" 
id="parquet__parquet_etl">
+
+    <h2 class="title topictitle2" id="ariaid-title3">Loading Data into Parquet 
Tables</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        Choose from the following techniques for loading data into Parquet 
tables, depending on whether the
+        original data is already in an Impala table, or exists as raw data 
files outside Impala.
+      </p>
+
+      <p class="p">
+        If you already have data in an Impala or Hive table, perhaps in a 
different file format or partitioning
+        scheme, you can transfer the data to a Parquet table using the Impala 
<code class="ph codeph">INSERT...SELECT</code>
+        syntax. You can convert, filter, repartition, and do other things to 
the data as part of this same
+        <code class="ph codeph">INSERT</code> statement. See <a class="xref" 
href="#parquet_compression">Snappy and GZip Compression for Parquet Data 
Files</a> for some examples showing how to
+        insert data into Parquet tables.
+      </p>
+
+      <div class="p">
+        When inserting into partitioned tables, especially using the Parquet 
file format, you can include a hint in
+        the <code class="ph codeph">INSERT</code> statement to fine-tune the 
overall performance of the operation and its
+        resource usage:
+        <ul class="ul">
+          <li class="li">
+            These hints are available in Impala 1.2.2 and higher.
+          </li>
+
+          <li class="li">
+            You would only use these hints if an <code class="ph 
codeph">INSERT</code> into a partitioned Parquet table was
+            failing due to capacity limits, or if such an <code class="ph 
codeph">INSERT</code> was succeeding but with
+            less-than-optimal performance.
+          </li>
+
+          <li class="li">
+            To use these hints, put the hint keyword <code class="ph 
codeph">[SHUFFLE]</code> or <code class="ph codeph">[NOSHUFFLE]</code>
+            (including the square brackets) after the <code class="ph 
codeph">PARTITION</code> clause, immediately before the
+            <code class="ph codeph">SELECT</code> keyword.
+          </li>
+
+          <li class="li">
+            <code class="ph codeph">[SHUFFLE]</code> selects an execution plan 
that minimizes the number of files being written
+            simultaneously to HDFS, and the number of memory buffers holding 
data for individual partitions. Thus
+            it reduces overall resource usage for the <code class="ph 
codeph">INSERT</code> operation, allowing some
+            <code class="ph codeph">INSERT</code> operations to succeed that 
otherwise would fail. It does involve some data
+            transfer between the nodes so that the data files for a particular 
partition are all constructed on the
+            same node.
+          </li>
+
+          <li class="li">
+            <code class="ph codeph">[NOSHUFFLE]</code> selects an execution 
plan that might be faster overall, but might also
+            produce a larger number of small data files or exceed capacity 
limits, causing the
+            <code class="ph codeph">INSERT</code> operation to fail. Use <code 
class="ph codeph">[SHUFFLE]</code> in cases where an
+            <code class="ph codeph">INSERT</code> statement fails or runs 
inefficiently due to all nodes attempting to construct
+            data for all partitions.
+          </li>
+
+          <li class="li">
+            Impala automatically uses the <code class="ph 
codeph">[SHUFFLE]</code> method if any partition key column in the
+            source table, mentioned in the <code class="ph codeph">INSERT ... 
SELECT</code> query, does not have column
+            statistics. In this case, only the <code class="ph 
codeph">[NOSHUFFLE]</code> hint would have any effect.
+          </li>
+
+          <li class="li">
+            If column statistics are available for all partition key columns 
in the source table mentioned in the
+            <code class="ph codeph">INSERT ... SELECT</code> query, Impala 
chooses whether to use the <code class="ph codeph">[SHUFFLE]</code>
+            or <code class="ph codeph">[NOSHUFFLE]</code> technique based on 
the estimated number of distinct values in those
+            columns and the number of nodes involved in the <code class="ph 
codeph">INSERT</code> operation. In this case, you
+            might need the <code class="ph codeph">[SHUFFLE]</code> or the 
<code class="ph codeph">[NOSHUFFLE]</code> hint to override the
+            execution plan selected by Impala.
+          </li>
+        </ul>
+      </div>
+
+      <p class="p">
+        Any <code class="ph codeph">INSERT</code> statement for a Parquet 
table requires enough free space in the HDFS filesystem
+        to write one block. Because Parquet data files use a block size of 1 
GB by default, an
+        <code class="ph codeph">INSERT</code> might fail (even for a very 
small amount of data) if your HDFS is running low on
+        space.
+      </p>
+
+      
+
+      <p class="p">
+        Avoid the <code class="ph codeph">INSERT...VALUES</code> syntax for 
Parquet tables, because
+        <code class="ph codeph">INSERT...VALUES</code> produces a separate 
tiny data file for each
+        <code class="ph codeph">INSERT...VALUES</code> statement, and the 
strength of Parquet is in its handling of data
+        (compressing, parallelizing, and so on) in <span 
class="ph">large</span> chunks.
+      </p>
+
+      <p class="p">
+        If you have one or more Parquet data files produced outside of Impala, 
you can quickly make the data
+        queryable through Impala by one of the following methods:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          The <code class="ph codeph">LOAD DATA</code> statement moves a 
single data file or a directory full of data files into
+          the data directory for an Impala table. It does no validation or 
conversion of the data. The original
+          data files must be somewhere in HDFS, not the local filesystem.
+          
+        </li>
+
+        <li class="li">
+          The <code class="ph codeph">CREATE TABLE</code> statement with the 
<code class="ph codeph">LOCATION</code> clause creates a table
+          where the data continues to reside outside the Impala data 
directory. The original data files must be
+          somewhere in HDFS, not the local filesystem. For extra safety, if 
the data is intended to be long-lived
+          and reused by other applications, you can use the <code class="ph 
codeph">CREATE EXTERNAL TABLE</code> syntax so that
+          the data files are not deleted by an Impala <code class="ph 
codeph">DROP TABLE</code> statement.
+          
+        </li>
+
+        <li class="li">
+          If the Parquet table already exists, you can copy Parquet data files 
directly into it, then use the
+          <code class="ph codeph">REFRESH</code> statement to make Impala 
recognize the newly added data. Remember to preserve
+          the block size of the Parquet data files by using the <code 
class="ph codeph">hadoop distcp -pb</code> command rather
+          than a <code class="ph codeph">-put</code> or <code class="ph 
codeph">-cp</code> operation on the Parquet files. See
+          <a class="xref" href="#parquet_compression_multiple">Example of 
Copying Parquet Data Files</a> for an example of this kind of operation.
+        </li>
+      </ul>
+
+      <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        <p class="p">
+          Currently, Impala always decodes the column data in Parquet files 
based on the ordinal position of the
+          columns, not by looking up the position of each column based on its 
name. Parquet files produced outside
+          of Impala must write column data in the same order as the columns 
are declared in the Impala table. Any
+          optional columns that are omitted from the data files must be the 
rightmost columns in the Impala table
+          definition.
+        </p>
+
+        <p class="p">
+          If you created compressed Parquet files through some tool other than 
Impala, make sure that any
+          compression codecs are supported in Parquet by Impala. For example, 
Impala does not currently support LZO
+          compression in Parquet files. Also doublecheck that you used any 
recommended compatibility settings in
+          the other tool, such as <code class="ph 
codeph">spark.sql.parquet.binaryAsString</code> when writing Parquet files
+          through Spark.
+        </p>
+      </div>
+
+      <p class="p">
+        Recent versions of Sqoop can produce Parquet output files using the 
<code class="ph codeph">--as-parquetfile</code>
+        option.
+      </p>
+
+      <p class="p"> If you use Sqoop to
+        convert RDBMS data to Parquet, be careful with interpreting any
+        resulting values from <code class="ph codeph">DATE</code>, <code 
class="ph codeph">DATETIME</code>,
+        or <code class="ph codeph">TIMESTAMP</code> columns. The underlying 
values are
+        represented as the Parquet <code class="ph codeph">INT64</code> type, 
which is
+        represented as <code class="ph codeph">BIGINT</code> in the Impala 
table. The Parquet
+        values represent the time in milliseconds, while Impala interprets
+          <code class="ph codeph">BIGINT</code> as the time in seconds. 
Therefore, if you have
+        a <code class="ph codeph">BIGINT</code> column in a Parquet table that 
was imported
+        this way from Sqoop, divide the values by 1000 when interpreting as the
+          <code class="ph codeph">TIMESTAMP</code> type.</p>
+
+      <p class="p">
+        If the data exists outside Impala and is in some other format, combine 
both of the preceding techniques.
+        First, use a <code class="ph codeph">LOAD DATA</code> or <code 
class="ph codeph">CREATE EXTERNAL TABLE ... LOCATION</code> statement to
+        bring the data into an Impala table that uses the appropriate file 
format. Then, use an
+        <code class="ph codeph">INSERT...SELECT</code> statement to copy the 
data to the Parquet table, converting to Parquet
+        format as part of the process.
+      </p>
+
+      
+
+      <p class="p">
+        Loading data into Parquet tables is a memory-intensive operation, 
because the incoming data is buffered
+        until it reaches <span class="ph">one data block</span> in size, then 
that chunk of data is
+        organized and compressed in memory before being written out. The 
memory consumption can be larger when
+        inserting data into partitioned Parquet tables, because a separate 
data file is written for each
+        combination of partition key column values, potentially requiring 
several
+        <span class="ph">large</span> chunks to be manipulated in memory at 
once.
+      </p>
+
+      <p class="p">
+        When inserting into a partitioned Parquet table, Impala redistributes 
the data among the nodes to reduce
+        memory consumption. You might still need to temporarily increase the 
memory dedicated to Impala during the
+        insert operation, or break up the load operation into several <code 
class="ph codeph">INSERT</code> statements, or both.
+      </p>
+
+      <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        All the preceding techniques assume that the data you are loading 
matches the structure of the destination
+        table, including column order, column names, and partition layout. To 
transform or reorganize the data,
+        start by loading the data into a Parquet table that matches the 
underlying structure of the data, then use
+        one of the table-copying techniques such as <code class="ph 
codeph">CREATE TABLE AS SELECT</code> or <code class="ph codeph">INSERT ...
+        SELECT</code> to reorder or rename columns, divide the data among 
multiple partitions, and so on. For
+        example to take a single comprehensive Parquet data file and load it 
into a partitioned table, you would
+        use an <code class="ph codeph">INSERT ... SELECT</code> statement with 
dynamic partitioning to let Impala create separate
+        data files with the appropriate partition values; for an example, see
+        <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>.
+      </div>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title4" 
id="parquet__parquet_performance">
+
+    <h2 class="title topictitle2" id="ariaid-title4">Query Performance for 
Impala Parquet Tables</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        Query performance for Parquet tables depends on the number of columns 
needed to process the
+        <code class="ph codeph">SELECT</code> list and <code class="ph 
codeph">WHERE</code> clauses of the query, the way data is divided into
+        <span class="ph">large data files with block size equal to file 
size</span>, the reduction in I/O
+        by reading the data for each column in compressed format, which data 
files can be skipped (for partitioned
+        tables), and the CPU overhead of decompressing the data for each 
column.
+      </p>
+
+      <div class="p">
+        For example, the following is an efficient query for a Parquet table:
+<pre class="pre codeblock"><code>select avg(income) from census_data where 
state = 'CA';</code></pre>
+        The query processes only 2 columns out of a large number of total 
columns. If the table is partitioned by
+        the <code class="ph codeph">STATE</code> column, it is even more 
efficient because the query only has to read and decode
+        1 column from each data file, and it can read only the data files in 
the partition directory for the state
+        <code class="ph codeph">'CA'</code>, skipping the data files for all 
the other states, which will be physically located
+        in other directories.
+      </div>
+
+      <div class="p">
+        The following is a relatively inefficient query for a Parquet table:
+<pre class="pre codeblock"><code>select * from census_data;</code></pre>
+        Impala would have to read the entire contents of each <span 
class="ph">large</span> data file,
+        and decompress the contents of each column for each row group, 
negating the I/O optimizations of the
+        column-oriented format. This query might still be faster for a Parquet 
table than a table with some other
+        file format, but it does not take advantage of the unique strengths of 
Parquet data files.
+      </div>
+
+      <p class="p">
+        Impala can optimize queries on Parquet tables, especially join 
queries, better when statistics are
+        available for all the tables. Issue the <code class="ph 
codeph">COMPUTE STATS</code> statement for each table after
+        substantial amounts of data are loaded into or appended to it. See
+        <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE 
STATS Statement</a> for details.
+      </p>
+
+      <p class="p">
+        The runtime filtering feature, available in <span 
class="keyword">Impala 2.5</span> and higher, works best with Parquet tables.
+        The per-row filtering aspect only applies to Parquet tables.
+        See <a class="xref" 
href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for 
Impala Queries (Impala 2.5 or higher only)</a> for details.
+      </p>
+
+      <p class="p">
+        In <span class="keyword">Impala 2.6</span> and higher, Impala queries 
are optimized for files stored in Amazon S3.
+        For Impala tables that use the file formats Parquet, RCFile, 
SequenceFile,
+        Avro, and uncompressed text, the setting <code class="ph 
codeph">fs.s3a.block.size</code>
+        in the <span class="ph filepath">core-site.xml</span> configuration 
file determines
+        how Impala divides the I/O work of reading the data files. This 
configuration
+        setting is specified in bytes. By default, this
+        value is 33554432 (32 MB), meaning that Impala parallelizes S3 read 
operations on the files
+        as if they were made up of 32 MB blocks. For example, if your S3 
queries primarily access
+        Parquet files written by MapReduce or Hive, increase <code class="ph 
codeph">fs.s3a.block.size</code>
+        to 134217728 (128 MB) to match the row group size of those files. If 
most S3 queries involve
+        Parquet files written by Impala, increase <code class="ph 
codeph">fs.s3a.block.size</code>
+        to 268435456 (256 MB) to match the row group size produced by Impala.
+      </p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title5" 
id="parquet_performance__parquet_partitioning">
+
+      <h3 class="title topictitle3" id="ariaid-title5">Partitioning for 
Parquet Tables</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          As explained in <a class="xref" 
href="impala_partitioning.html#partitioning">Partitioning for Impala 
Tables</a>, partitioning is an important
+          performance technique for Impala generally. This section explains 
some of the performance considerations
+          for partitioned Parquet tables.
+        </p>
+
+        <p class="p">
+          The Parquet file format is ideal for tables containing many columns, 
where most queries only refer to a
+          small subset of the columns. As explained in <a class="xref" 
href="#parquet_data_files">How Parquet Data Files Are Organized</a>, the 
physical layout of
+          Parquet data files lets Impala read only a small fraction of the 
data for many queries. The performance
+          benefits of this approach are amplified when you use Parquet tables 
in combination with partitioning.
+          Impala can skip the data files for certain partitions entirely, 
based on the comparisons in the
+          <code class="ph codeph">WHERE</code> clause that refer to the 
partition key columns. For example, queries on
+          partitioned tables often analyze data for time intervals based on 
columns such as <code class="ph codeph">YEAR</code>,
+          <code class="ph codeph">MONTH</code>, and/or <code class="ph 
codeph">DAY</code>, or for geographic regions. Remember that Parquet
+          data files use a <span class="ph">large</span> block size, so when 
deciding how finely to
+          partition the data, try to find a granularity where each partition 
contains
+          <span class="ph">256 MB</span> or more of data, rather than creating 
a large number of smaller
+          files split among many partitions.
+        </p>
+
+        <p class="p">
+          Inserting into a partitioned Parquet table can be a 
resource-intensive operation, because each Impala
+          node could potentially be writing a separate data file to HDFS for 
each combination of different values
+          for the partition key columns. The large number of simultaneous open 
files could exceed the HDFS
+          <span class="q">"transceivers"</span> limit. To avoid exceeding this 
limit, consider the following techniques:
+        </p>
+
+        <ul class="ul">
+          <li class="li">
+            Load different subsets of data using separate <code class="ph 
codeph">INSERT</code> statements with specific values
+            for the <code class="ph codeph">PARTITION</code> clause, such as 
<code class="ph codeph">PARTITION (year=2010)</code>.
+          </li>
+
+          <li class="li">
+            Increase the <span class="q">"transceivers"</span> value for HDFS, 
sometimes spelled <span class="q">"xcievers"</span> (sic). The property
+            value in the <span class="ph filepath">hdfs-site.xml</span> 
configuration file is
+
+            <code class="ph codeph">dfs.datanode.max.transfer.threads</code>. 
For example, if you were loading 12 years of data
+            partitioned by year, month, and day, even a value of 4096 might 
not be high enough. This
+            <a class="xref" 
href="http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/"; 
target="_blank">blog post</a> explores the considerations for setting this value
+            higher or lower, using HBase examples for illustration.
+          </li>
+
+          <li class="li">
+            Use the <code class="ph codeph">COMPUTE STATS</code> statement to 
collect
+            <a class="xref" 
href="impala_perf_stats.html#perf_column_stats">column statistics</a> on the 
source table from
+            which data is being copied, so that the Impala query can estimate 
the number of different values in the
+            partition key columns and distribute the work accordingly.
+          </li>
+        </ul>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title6" 
id="parquet__parquet_compression">
+
+    <h2 class="title topictitle2" id="ariaid-title6">Snappy and GZip 
Compression for Parquet Data Files</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        
+        When Impala writes Parquet data files using the <code class="ph 
codeph">INSERT</code> statement, the underlying
+        compression is controlled by the <code class="ph 
codeph">COMPRESSION_CODEC</code> query option. (Prior to Impala 2.0, the
+        query option name was <code class="ph 
codeph">PARQUET_COMPRESSION_CODEC</code>.) The allowed values for this query 
option
+        are <code class="ph codeph">snappy</code> (the default), <code 
class="ph codeph">gzip</code>, and <code class="ph codeph">none</code>. The 
option
+        value is not case-sensitive. If the option is set to an unrecognized 
value, all kinds of queries will fail
+        due to the invalid option setting, not just queries involving Parquet 
tables.
+      </p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title7" 
id="parquet_compression__parquet_snappy">
+
+      <h3 class="title topictitle3" id="ariaid-title7">Example of Parquet 
Table with Snappy Compression</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          
+          By default, the underlying data files for a Parquet table are 
compressed with Snappy. The combination of
+          fast compression and decompression makes it a good choice for many 
data sets. To ensure Snappy
+          compression is used, for example after experimenting with other 
compression codecs, set the
+          <code class="ph codeph">COMPRESSION_CODEC</code> query option to 
<code class="ph codeph">snappy</code> before inserting the data:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create database 
parquet_compression;
+[localhost:21000] &gt; use parquet_compression;
+[localhost:21000] &gt; create table parquet_snappy like raw_text_data;
+[localhost:21000] &gt; set COMPRESSION_CODEC=snappy;
+[localhost:21000] &gt; insert into parquet_snappy select * from raw_text_data;
+Inserted 1000000000 rows in 181.98s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title8" 
id="parquet_compression__parquet_gzip">
+
+      <h3 class="title topictitle3" id="ariaid-title8">Example of Parquet 
Table with GZip Compression</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          If you need more intensive compression (at the expense of more CPU 
cycles for uncompressing during
+          queries), set the <code class="ph codeph">COMPRESSION_CODEC</code> 
query option to <code class="ph codeph">gzip</code> before
+          inserting the data:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table 
parquet_gzip like raw_text_data;
+[localhost:21000] &gt; set COMPRESSION_CODEC=gzip;
+[localhost:21000] &gt; insert into parquet_gzip select * from raw_text_data;
+Inserted 1000000000 rows in 1418.24s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title9" 
id="parquet_compression__parquet_none">
+
+      <h3 class="title topictitle3" id="ariaid-title9">Example of Uncompressed 
Parquet Table</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          If your data compresses very poorly, or you want to avoid the CPU 
overhead of compression and
+          decompression entirely, set the <code class="ph 
codeph">COMPRESSION_CODEC</code> query option to <code class="ph 
codeph">none</code>
+          before inserting the data:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table 
parquet_none like raw_text_data;
+[localhost:21000] &gt; set COMPRESSION_CODEC=none;
+[localhost:21000] &gt; insert into parquet_none select * from raw_text_data;
+Inserted 1000000000 rows in 146.90s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title10" 
id="parquet_compression__parquet_compression_examples">
+
+      <h3 class="title topictitle3" id="ariaid-title10">Examples of Sizes and 
Speeds for Compressed Parquet Tables</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          Here are some examples showing differences in data sizes and query 
speeds for 1 billion rows of synthetic
+          data, compressed with each kind of codec. As always, run similar 
tests with realistic data sets of your
+          own. The actual compression ratios, and relative insert and query 
speeds, will vary depending on the
+          characteristics of the actual data.
+        </p>
+
+        <p class="p">
+          In this case, switching from Snappy to GZip compression shrinks the 
data by an additional 40% or so,
+          while switching from Snappy compression to no compression expands 
the data also by about 40%:
+        </p>
+
+<pre class="pre codeblock"><code>$ hdfs dfs -du -h 
/user/hive/warehouse/parquet_compression.db
+23.1 G  /user/hive/warehouse/parquet_compression.db/parquet_snappy
+13.5 G  /user/hive/warehouse/parquet_compression.db/parquet_gzip
+32.8 G  /user/hive/warehouse/parquet_compression.db/parquet_none
+</code></pre>
+
+        <p class="p">
+          Because Parquet data files are typically <span 
class="ph">large</span>, each directory will
+          have a different number of data files and the row groups will be 
arranged differently.
+        </p>
+
+        <p class="p">
+          At the same time, the less agressive the compression, the faster the 
data can be decompressed. In this
+          case using a table with a billion rows, a query that evaluates all 
the values for a particular column
+          runs faster with no compression than with Snappy compression, and 
faster with Snappy compression than
+          with Gzip compression. Query performance depends on several other 
factors, so as always, run your own
+          benchmarks with your own data to determine the ideal tradeoff 
between data size, CPU efficiency, and
+          speed of insert and query operations.
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; desc parquet_snappy;
+Query finished, fetching results ...
++-----------+---------+---------+
+| name      | type    | comment |
++-----------+---------+---------+
+| id        | int     |         |
+| val       | int     |         |
+| zfill     | string  |         |
+| name      | string  |         |
+| assertion | boolean |         |
++-----------+---------+---------+
+Returned 5 row(s) in 0.14s
+[localhost:21000] &gt; select avg(val) from parquet_snappy;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 4.29s
+[localhost:21000] &gt; select avg(val) from parquet_gzip;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 6.97s
+[localhost:21000] &gt; select avg(val) from parquet_none;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 3.67s
+</code></pre>
+
+      </div>
+
+    </article>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title11" 
id="parquet_compression__parquet_compression_multiple">
+
+      <h3 class="title topictitle3" id="ariaid-title11">Example of Copying 
Parquet Data Files</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          Here is a final example, to illustrate how the data files using the 
various compression codecs are all
+          compatible with each other for read operations. The metadata about 
the compression format is written into
+          each data file, and can be decoded during queries regardless of the 
<code class="ph codeph">COMPRESSION_CODEC</code>
+          setting in effect at the time. In this example, we copy data files 
from the
+          <code class="ph codeph">PARQUET_SNAPPY</code>, <code class="ph 
codeph">PARQUET_GZIP</code>, and <code class="ph codeph">PARQUET_NONE</code> 
tables
+          used in the previous examples, each containing 1 billion rows, all 
to the data directory of a new table
+          <code class="ph codeph">PARQUET_EVERYTHING</code>. A couple of 
sample queries demonstrate that the new table now
+          contains 3 billion rows featuring a variety of compression codecs 
for the data files.
+        </p>
+
+        <p class="p">
+          First, we create the table in Impala so that there is a destination 
directory in HDFS to put the data
+          files:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; create table 
parquet_everything like parquet_snappy;
+Query: create table parquet_everything like parquet_snappy
+</code></pre>
+
+        <p class="p">
+          Then in the shell, we copy the relevant data files into the data 
directory for this new table. Rather
+          than using <code class="ph codeph">hdfs dfs -cp</code> as with 
typical files, we use <code class="ph codeph">hadoop distcp -pb</code>
+          to ensure that the special <span class="ph"> block size</span> of 
the Parquet data files is
+          preserved.
+        </p>
+
+<pre class="pre codeblock"><code>$ hadoop distcp -pb 
/user/hive/warehouse/parquet_compression.db/parquet_snappy \
+  /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip  \
+  /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none  \
+  /user/hive/warehouse/parquet_compression.db/parquet_everything
+...<var class="keyword varname">MapReduce output</var>...
+</code></pre>
+
+        <p class="p">
+          Back in the <span class="keyword cmdname">impala-shell</span> 
interpreter, we use the <code class="ph codeph">REFRESH</code> statement to
+          alert the Impala server to the new data files for this table, then 
we can run queries demonstrating that
+          the data files represent 3 billion rows, and the values for one of 
the numeric columns match what was in
+          the original smaller tables:
+        </p>
+
+<pre class="pre codeblock"><code>[localhost:21000] &gt; refresh 
parquet_everything;
+Query finished, fetching results ...
+
+Returned 0 row(s) in 0.32s
+[localhost:21000] &gt; select count(*) from parquet_everything;
+Query finished, fetching results ...
++------------+
+| _c0        |
++------------+
+| 3000000000 |
++------------+
+Returned 1 row(s) in 8.18s
+[localhost:21000] &gt; select avg(val) from parquet_everything;
+Query finished, fetching results ...
++-----------------+
+| _c0             |
++-----------------+
+| 250000.93577915 |
++-----------------+
+Returned 1 row(s) in 13.35s
+</code></pre>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title12" 
id="parquet__parquet_complex_types">
+
+    <h2 class="title topictitle2" id="ariaid-title12">Parquet Tables for 
Impala Complex Types</h2>
+
+    <div class="body conbody">
+
+    <p class="p">
+      In <span class="keyword">Impala 2.3</span> and higher, Impala supports 
the complex types
+      <code class="ph codeph">ARRAY</code>, <code class="ph 
codeph">STRUCT</code>, and <code class="ph codeph">MAP</code>
+      See <a class="xref" 
href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or 
higher only)</a> for details.
+      Because these data types are currently supported only for the Parquet 
file format,
+      if you plan to use them, become familiar with the performance and 
storage aspects
+      of Parquet first.
+    </p>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title13" 
id="parquet__parquet_interop">
+
+    <h2 class="title topictitle2" id="ariaid-title13">Exchanging Parquet Data 
Files with Other Hadoop Components</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        You can read and write Parquet data files from other <span 
class="keyword"></span> components.
+        See <span class="xref">the documentation for your Apache Hadoop 
distribution</span> for details.
+      </p>
+
+
+
+
+
+
+
+
+
+      <p class="p">
+        Previously, it was not possible to create Parquet data through Impala 
and reuse that table within Hive. Now
+        that Parquet support is available for Hive, reusing existing Impala 
Parquet data files in Hive
+        requires updating the table metadata. Use the following command if you 
are already running Impala 1.1.1 or
+        higher:
+      </p>
+
+<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword 
varname">table_name</var> SET FILEFORMAT PARQUET;
+</code></pre>
+
+      <p class="p">
+        If you are running a level of Impala that is older than 1.1.1, do the 
metadata update through Hive:
+      </p>
+
+<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword 
varname">table_name</var> SET SERDE 'parquet.hive.serde.ParquetHiveSerDe';
+ALTER TABLE <var class="keyword varname">table_name</var> SET FILEFORMAT
+  INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat"
+  OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat";
+</code></pre>
+
+      <p class="p">
+        Impala 1.1.1 and higher can reuse Parquet data files created by Hive, 
without any action required.
+      </p>
+
+
+
+      <p class="p">
+        Impala supports the scalar data types that you can encode in a Parquet 
data file, but not composite or
+        nested types such as maps or arrays. In <span class="keyword">Impala 
2.2</span> and higher, Impala can query Parquet data
+        files that include composite or nested types, as long as the query 
only refers to columns with scalar
+        types.
+
+      </p>
+
+      <p class="p">
+        If you copy Parquet data files between nodes, or even between 
different directories on the same node, make
+        sure to preserve the block size by using the command <code class="ph 
codeph">hadoop distcp -pb</code>. To verify that the
+        block size was preserved, issue the command <code class="ph 
codeph">hdfs fsck -blocks
+        <var class="keyword 
varname">HDFS_path_of_impala_table_dir</var></code> and check that the average 
block size is at or
+        near <span class="ph">256 MB (or whatever other size is defined by the
+        <code class="ph codeph">PARQUET_FILE_SIZE</code> query 
option).</span>. (The <code class="ph codeph">hadoop distcp</code> operation
+        typically leaves some directories behind, with names matching <span 
class="ph filepath">_distcp_logs_*</span>, that you
+        can delete from the destination directory afterward.)
+
+
+
+        Issue the command <span class="keyword cmdname">hadoop distcp</span> 
for details about <span class="keyword cmdname">distcp</span> command
+        syntax.
+      </p>
+
+
+
+      <p class="p">
+        Impala can query Parquet files that use the <code class="ph 
codeph">PLAIN</code>, <code class="ph codeph">PLAIN_DICTIONARY</code>,
+        <code class="ph codeph">BIT_PACKED</code>, and <code class="ph 
codeph">RLE</code> encodings.
+        Currently, Impala does not support <code class="ph 
codeph">RLE_DICTIONARY</code> encoding.
+        When creating files outside of Impala for use by Impala, make sure to 
use one of the supported encodings.
+        In particular, for MapReduce jobs, <code class="ph 
codeph">parquet.writer.version</code> must not be defined
+        (especially as <code class="ph codeph">PARQUET_2_0</code>) for writing 
the configurations of Parquet MR jobs.
+        Use the default version (or format). The default format, 1.0, includes 
some enhancements that are compatible with older versions.
+        Data using the 2.0 format might not be consumable by Impala, due to 
use of the <code class="ph codeph">RLE_DICTIONARY</code> encoding.
+      </p>
+      <div class="p">
+        To examine the internal structure and data of Parquet files, you can 
use the
+        <span class="keyword cmdname">parquet-tools</span> command. Make sure 
this
+        command is in your <code class="ph codeph">$PATH</code>. (Typically, 
it is symlinked from
+        <span class="ph filepath">/usr/bin</span>; sometimes, depending on 
your installation setup, you
+        might need to locate it under an alternative  <code class="ph 
codeph">bin</code> directory.)
+        The arguments to this command let you perform operations such as:
+        <ul class="ul">
+          <li class="li">
+            <code class="ph codeph">cat</code>: Print a file's contents to 
standard out. In <span class="keyword">Impala 2.3</span> and higher, you can use
+            the <code class="ph codeph">-j</code> option to output JSON.
+          </li>
+          <li class="li">
+            <code class="ph codeph">head</code>: Print the first few records 
of a file to standard output.
+          </li>
+          <li class="li">
+            <code class="ph codeph">schema</code>: Print the Parquet schema 
for the file.
+          </li>
+          <li class="li">
+            <code class="ph codeph">meta</code>: Print the file footer 
metadata, including key-value properties (like Avro schema), compression ratios,
+            encodings, compression used, and row group information.
+          </li>
+          <li class="li">
+            <code class="ph codeph">dump</code>: Print all data and metadata.
+          </li>
+        </ul>
+        Use <code class="ph codeph">parquet-tools -h</code> to see usage 
information for all the arguments.
+        Here are some examples showing <span class="keyword 
cmdname">parquet-tools</span> usage:
+
+<pre class="pre codeblock"><code>
+$ # Be careful doing this for a big file! Use parquet-tools head to be safe.
+$ parquet-tools cat sample.parq
+year = 1992
+month = 1
+day = 2
+dayofweek = 4
+dep_time = 748
+crs_dep_time = 750
+arr_time = 851
+crs_arr_time = 846
+carrier = US
+flight_num = 53
+actual_elapsed_time = 63
+crs_elapsed_time = 56
+arrdelay = 5
+depdelay = -2
+origin = CMH
+dest = IND
+distance = 182
+cancelled = 0
+diverted = 0
+
+year = 1992
+month = 1
+day = 3
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools head -n 2 sample.parq
+year = 1992
+month = 1
+day = 2
+dayofweek = 4
+dep_time = 748
+crs_dep_time = 750
+arr_time = 851
+crs_arr_time = 846
+carrier = US
+flight_num = 53
+actual_elapsed_time = 63
+crs_elapsed_time = 56
+arrdelay = 5
+depdelay = -2
+origin = CMH
+dest = IND
+distance = 182
+cancelled = 0
+diverted = 0
+
+year = 1992
+month = 1
+day = 3
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools schema sample.parq
+message schema {
+  optional int32 year;
+  optional int32 month;
+  optional int32 day;
+  optional int32 dayofweek;
+  optional int32 dep_time;
+  optional int32 crs_dep_time;
+  optional int32 arr_time;
+  optional int32 crs_arr_time;
+  optional binary carrier;
+  optional int32 flight_num;
+...
+
+</code></pre>
+
+<pre class="pre codeblock"><code>
+$ parquet-tools meta sample.parq
+creator:             impala version 2.2.0-...
+
+file schema:         schema
+-------------------------------------------------------------------
+year:                OPTIONAL INT32 R:0 D:1
+month:               OPTIONAL INT32 R:0 D:1
+day:                 OPTIONAL INT32 R:0 D:1
+dayofweek:           OPTIONAL INT32 R:0 D:1
+dep_time:            OPTIONAL INT32 R:0 D:1
+crs_dep_time:        OPTIONAL INT32 R:0 D:1
+arr_time:            OPTIONAL INT32 R:0 D:1
+crs_arr_time:        OPTIONAL INT32 R:0 D:1
+carrier:             OPTIONAL BINARY R:0 D:1
+flight_num:          OPTIONAL INT32 R:0 D:1
+...
+
+row group 1:         RC:20636601 TS:265103674
+-------------------------------------------------------------------
+year:                 INT32 SNAPPY DO:4 FPO:35 SZ:10103/49723/4.92 VC:20636601 
ENC:PLAIN_DICTIONARY,RLE,PLAIN
+month:                INT32 SNAPPY DO:10147 FPO:10210 SZ:11380/35732/3.14 
VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+day:                  INT32 SNAPPY DO:21572 FPO:21714 SZ:3071658/9868452/3.21 
VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+dayofweek:            INT32 SNAPPY DO:3093276 FPO:3093319 
SZ:2274375/5941876/2.61 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+dep_time:             INT32 SNAPPY DO:5367705 FPO:5373967 
SZ:28281281/28573175/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+crs_dep_time:         INT32 SNAPPY DO:33649039 FPO:33654262 
SZ:10220839/11574964/1.13 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+arr_time:             INT32 SNAPPY DO:43869935 FPO:43876489 
SZ:28562410/28797767/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+crs_arr_time:         INT32 SNAPPY DO:72432398 FPO:72438151 
SZ:10908972/12164626/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+carrier:              BINARY SNAPPY DO:83341427 FPO:83341558 
SZ:114916/128611/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+flight_num:           INT32 SNAPPY DO:83456393 FPO:83488603 
SZ:10216514/11474301/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN
+...
+
+</code></pre>
+      </div>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title14" 
id="parquet__parquet_data_files">
+
+    <h2 class="title topictitle2" id="ariaid-title14">How Parquet Data Files 
Are Organized</h2>
+  
+
+    <div class="body conbody">
+
+      <p class="p">
+        Although Parquet is a column-oriented file format, do not expect to 
find one data file for each column.
+        Parquet keeps all the data for a row within the same data file, to 
ensure that the columns for a row are
+        always available on the same node for processing. What Parquet does is 
to set a large HDFS block size and a
+        matching maximum data file size, to ensure that I/O and network 
transfer requests apply to large batches of
+        data.
+      </p>
+
+      <p class="p">
+        Within that data file, the data for a set of rows is rearranged so 
that all the values from the first
+        column are organized in one contiguous block, then all the values from 
the second column, and so on.
+        Putting the values from the same column next to each other lets Impala 
use effective compression techniques
+        on the values in that column.
+      </p>
+
+      <div class="note note note_note"><span class="note__title 
notetitle">Note:</span> 
+        <p class="p">
+          Impala <code class="ph codeph">INSERT</code> statements write 
Parquet data files using an HDFS block size
+          <span class="ph">that matches the data file size</span>, to ensure 
that each data file is
+          represented by a single HDFS block, and the entire file can be 
processed on a single node without
+          requiring any remote reads.
+        </p>
+
+        <p class="p">
+          If you create Parquet data files outside of Impala, such as through 
a MapReduce or Pig job, ensure that
+          the HDFS block size is greater than or equal to the file size, so 
that the <span class="q">"one file per block"</span>
+          relationship is maintained. Set the <code class="ph 
codeph">dfs.block.size</code> or the <code class="ph 
codeph">dfs.blocksize</code>
+          property large enough that each file fits within a single HDFS 
block, even if that size is larger than
+          the normal HDFS block size.
+        </p>
+
+        <p class="p">
+          If the block size is reset to a lower value during a file copy, you 
will see lower performance for
+          queries involving those files, and the <code class="ph 
codeph">PROFILE</code> statement will reveal that some I/O is
+          being done suboptimally, through remote reads. See
+          <a class="xref" 
href="impala_parquet.html#parquet_compression_multiple">Example of Copying 
Parquet Data Files</a> for an example showing how to preserve the
+          block size when copying Parquet data files.
+        </p>
+      </div>
+
+      <p class="p">
+        When Impala retrieves or tests the data for a particular column, it 
opens all the data files, but only
+        reads the portion of each file containing the values for that column. 
The column values are stored
+        consecutively, minimizing the I/O required to process the values 
within a single column. If other columns
+        are named in the <code class="ph codeph">SELECT</code> list or <code 
class="ph codeph">WHERE</code> clauses, the data for all columns
+        in the same row is available within that same data file.
+      </p>
+
+      <p class="p">
+        If an <code class="ph codeph">INSERT</code> statement brings in less 
than <span class="ph">one Parquet
+        block's worth</span> of data, the resulting data file is smaller than 
ideal. Thus, if you do split up an ETL
+        job to use multiple <code class="ph codeph">INSERT</code> statements, 
try to keep the volume of data for each
+        <code class="ph codeph">INSERT</code> statement to approximately <span 
class="ph">256 MB, or a multiple of
+        256 MB</span>.
+      </p>
+
+    </div>
+
+    <article class="topic concept nested2" aria-labelledby="ariaid-title15" 
id="parquet_data_files__parquet_encoding">
+
+      <h3 class="title topictitle3" id="ariaid-title15">RLE and Dictionary 
Encoding for Parquet Data Files</h3>
+
+      <div class="body conbody">
+
+        <p class="p">
+          Parquet uses some automatic compression techniques, such as 
run-length encoding (RLE) and dictionary
+          encoding, based on analysis of the actual data values. Once the data 
values are encoded in a compact
+          form, the encoded data can optionally be further compressed using a 
compression algorithm. Parquet data
+          files created by Impala can use Snappy, GZip, or no compression; the 
Parquet spec also allows LZO
+          compression, but currently Impala does not support LZO-compressed 
Parquet files.
+        </p>
+
+        <p class="p">
+          RLE and dictionary encoding are compression techniques that Impala 
applies automatically to groups of
+          Parquet data values, in addition to any Snappy or GZip compression 
applied to the entire data files.
+          These automatic optimizations can save you time and planning that 
are normally needed for a traditional
+          data warehouse. For example, dictionary encoding reduces the need to 
create numeric IDs as abbreviations
+          for longer string values.
+        </p>
+
+        <p class="p">
+          Run-length encoding condenses sequences of repeated data values. For 
example, if many consecutive rows
+          all contain the same value for a country code, those repeating 
values can be represented by the value
+          followed by a count of how many times it appears consecutively.
+        </p>
+
+        <p class="p">
+          Dictionary encoding takes the different values present in a column, 
and represents each one in compact
+          2-byte form rather than the original value, which could be several 
bytes. (Additional compression is
+          applied to the compacted values, for extra space savings.) This type 
of encoding applies when the number
+          of different values for a column is less than 2**16 (16,384). It 
does not apply to columns of data type
+          <code class="ph codeph">BOOLEAN</code>, which are already very 
short. <code class="ph codeph">TIMESTAMP</code> columns sometimes have
+          a unique value for each row, in which case they can quickly exceed 
the 2**16 limit on distinct values.
+          The 2**16 limit on different values within a column is reset for 
each data file, so if several different
+          data files each contained 10,000 different city names, the city name 
column in each data file could still
+          be condensed using dictionary encoding.
+        </p>
+
+      </div>
+
+    </article>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title16" 
id="parquet__parquet_compacting">
+
+    <h2 class="title topictitle2" id="ariaid-title16">Compacting Data Files 
for Parquet Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        If you reuse existing table structures or ETL processes for Parquet 
tables, you might encounter a <span class="q">"many
+        small files"</span> situation, which is suboptimal for query 
efficiency. For example, statements like these
+        might produce inefficiently organized data files:
+      </p>
+
+<pre class="pre codeblock"><code>-- In an N-node cluster, each node produces a 
data file
+-- for the INSERT operation. If you have less than
+-- N GB of data to copy, some files are likely to be
+-- much smaller than the <span class="ph">default Parquet</span> block size.
+insert into parquet_table select * from text_table;
+
+-- Even if this operation involves an overall large amount of data,
+-- when split up by year/month/day, each partition might only
+-- receive a small amount of data. Then the data files for
+-- the partition might be divided between the N nodes in the cluster.
+-- A multi-gigabyte copy operation might produce files of only
+-- a few MB each.
+insert into partitioned_parquet_table partition (year, month, day)
+  select year, month, day, url, referer, user_agent, http_code, response_time
+  from web_stats;
+</code></pre>
+
+      <p class="p">
+        Here are techniques to help you produce large data files in Parquet 
<code class="ph codeph">INSERT</code> operations, and
+        to compact existing too-small data files:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            When inserting into a partitioned Parquet table, use statically 
partitioned <code class="ph codeph">INSERT</code>
+            statements where the partition key values are specified as 
constant values. Ideally, use a separate
+            <code class="ph codeph">INSERT</code> statement for each partition.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+        You might set the <code class="ph codeph">NUM_NODES</code> option to 1 
briefly, during <code class="ph codeph">INSERT</code> or
+        <code class="ph codeph">CREATE TABLE AS SELECT</code> statements. 
Normally, those statements produce one or more data
+        files per data node. If the write operation involves small amounts of 
data, a Parquet table, and/or a
+        partitioned table, the default behavior could produce many small files 
when intuitively you might expect
+        only a single output file. <code class="ph codeph">SET 
NUM_NODES=1</code> turns off the <span class="q">"distributed"</span> aspect of 
the
+        write operation, making it more likely to produce only one or a few 
data files.
+      </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Be prepared to reduce the number of partition key columns from 
what you are used to with traditional
+            analytic database systems.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Do not expect Impala-written Parquet files to fill up the entire 
Parquet block size. Impala estimates
+            on the conservative side when figuring out how much data to write 
to each Parquet file. Typically, the
+            of uncompressed data in memory is substantially reduced on disk by 
the compression and encoding
+            techniques in the Parquet file format.
+
+            The final data file size varies depending on the compressibility 
of the data. Therefore, it is not an
+            indication of a problem if <span class="ph">256 MB</span> of text 
data is turned into 2
+            Parquet data files, each less than <span class="ph">256 MB</span>.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            If you accidentally end up with a table with many small data 
files, consider using one or more of the
+            preceding techniques and copying all the data into a new Parquet 
table, either through <code class="ph codeph">CREATE
+            TABLE AS SELECT</code> or <code class="ph codeph">INSERT ... 
SELECT</code> statements.
+          </p>
+
+          <p class="p">
+            To avoid rewriting queries to change table names, you can adopt a 
convention of always running
+            important queries against a view. Changing the view definition 
immediately switches any subsequent
+            queries to use the new underlying tables:
+          </p>
+<pre class="pre codeblock"><code>create view production_table as select * from 
table_with_many_small_files;
+-- CTAS or INSERT...SELECT all the data into a more efficient layout...
+alter view production_table as select * from table_with_few_big_files;
+select * from production_table where c1 = 100 and c2 &lt; 50 and ...;
+</code></pre>
+        </li>
+      </ul>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title17" 
id="parquet__parquet_schema_evolution">
+
+    <h2 class="title topictitle2" id="ariaid-title17">Schema Evolution for 
Parquet Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        Schema evolution refers to using the statement <code class="ph 
codeph">ALTER TABLE ... REPLACE COLUMNS</code> to change
+        the names, data type, or number of columns in a table. You can perform 
schema evolution for Parquet tables
+        as follows:
+      </p>
+
+      <ul class="ul">
+        <li class="li">
+          <p class="p">
+            The Impala <code class="ph codeph">ALTER TABLE</code> statement 
never changes any data files in the tables. From the
+            Impala side, schema evolution involves interpreting the same data 
files in terms of a new table
+            definition. Some types of schema changes make sense and are 
represented correctly. Other types of
+            changes cannot be represented in a sensible way, and produce 
special result values or conversion errors
+            during queries.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            The <code class="ph codeph">INSERT</code> statement always creates 
data using the latest table definition. You might
+            end up with data files with different numbers of columns or 
internal data representations if you do a
+            sequence of <code class="ph codeph">INSERT</code> and <code 
class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> statements.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            If you use <code class="ph codeph">ALTER TABLE ... REPLACE 
COLUMNS</code> to define additional columns at the end,
+            when the original data files are used in a query, these final 
columns are considered to be all
+            <code class="ph codeph">NULL</code> values.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            If you use <code class="ph codeph">ALTER TABLE ... REPLACE 
COLUMNS</code> to define fewer columns than before, when
+            the original data files are used in a query, the unused columns 
still present in the data file are
+            ignored.
+          </p>
+        </li>
+
+        <li class="li">
+          <p class="p">
+            Parquet represents the <code class="ph codeph">TINYINT</code>, 
<code class="ph codeph">SMALLINT</code>, and <code class="ph codeph">INT</code>
+            types the same internally, all stored in 32-bit integers.
+          </p>
+          <ul class="ul">
+            <li class="li">
+              That means it is easy to promote a <code class="ph 
codeph">TINYINT</code> column to <code class="ph codeph">SMALLINT</code> or
+              <code class="ph codeph">INT</code>, or a <code class="ph 
codeph">SMALLINT</code> column to <code class="ph codeph">INT</code>. The 
numbers are
+              represented exactly the same in the data file, and the columns 
being promoted would not contain any
+              out-of-range values.
+            </li>
+
+            <li class="li">
+              <p class="p">
+                If you change any of these column types to a smaller type, any 
values that are out-of-range for the
+                new type are returned incorrectly, typically as negative 
numbers.
+              </p>
+            </li>
+
+            <li class="li">
+              <p class="p">
+                You cannot change a <code class="ph codeph">TINYINT</code>, 
<code class="ph codeph">SMALLINT</code>, or <code class="ph codeph">INT</code>
+                column to <code class="ph codeph">BIGINT</code>, or the other 
way around. Although the <code class="ph codeph">ALTER
+                TABLE</code> succeeds, any attempt to query those columns 
results in conversion errors.
+              </p>
+            </li>
+
+            <li class="li">
+              <p class="p">
+                Any other type conversion for columns produces a conversion 
error during queries. For example,
+                <code class="ph codeph">INT</code> to <code class="ph 
codeph">STRING</code>, <code class="ph codeph">FLOAT</code> to <code class="ph 
codeph">DOUBLE</code>,
+                <code class="ph codeph">TIMESTAMP</code> to <code class="ph 
codeph">STRING</code>, <code class="ph codeph">DECIMAL(9,0)</code> to
+                <code class="ph codeph">DECIMAL(5,2)</code>, and so on.
+              </p>
+            </li>
+          </ul>
+        </li>
+      </ul>
+
+      <div class="p">
+        You might find that you have Parquet files where the columns do not 
line up in the same
+        order as in your Impala table. For example, you might have a Parquet 
file that was part of
+        a table with columns <code class="ph codeph">C1,C2,C3,C4</code>, and 
now you want to reuse the same
+        Parquet file in a table with columns <code class="ph 
codeph">C4,C2</code>. By default, Impala expects the
+        columns in the data file to appear in the same order as the columns 
defined for the table,
+        making it impractical to do some kinds of file reuse or schema 
evolution. In <span class="keyword">Impala 2.6</span>
+        and higher, the query option <code class="ph 
codeph">PARQUET_FALLBACK_SCHEMA_RESOLUTION=name</code> lets Impala
+        resolve columns by name, and therefore handle out-of-order or extra 
columns in the data file.
+        For example:
+
+<pre class="pre codeblock"><code>
+create database schema_evolution;
+use schema_evolution;
+create table t1 (c1 int, c2 boolean, c3 string, c4 timestamp)
+  stored as parquet;
+insert into t1 values
+  (1, true, 'yes', now()),
+  (2, false, 'no', now() + interval 1 day);
+
+select * from t1;
++----+-------+-----+-------------------------------+
+| c1 | c2    | c3  | c4                            |
++----+-------+-----+-------------------------------+
+| 1  | true  | yes | 2016-06-28 14:53:26.554369000 |
+| 2  | false | no  | 2016-06-29 14:53:26.554369000 |
++----+-------+-----+-------------------------------+
+
+desc formatted t1;
+...
+| Location:   | /user/hive/warehouse/schema_evolution.db/t1 |
+...
+
+-- Make T2 have the same data file as in T1, including 2
+-- unused columns and column order different than T2 expects.
+load data inpath '/user/hive/warehouse/schema_evolution.db/t1'
+  into table t2;
++----------------------------------------------------------+
+| summary                                                  |
++----------------------------------------------------------+
+| Loaded 1 file(s). Total files in destination location: 1 |
++----------------------------------------------------------+
+
+-- 'position' is the default setting.
+-- Impala cannot read the Parquet file if the column order does not match.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=position;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to position
+
+select * from t2;
+WARNINGS:
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+File 'schema_evolution.db/t2/45331705_data.0.parq'
+has an incompatible Parquet schema for column 'schema_evolution.t2.c4'.
+Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0]
+
+-- With the 'name' setting, Impala can read the Parquet data files
+-- despite mismatching column order.
+set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
+PARQUET_FALLBACK_SCHEMA_RESOLUTION set to name
+
+select * from t2;
++-------------------------------+-------+
+| c4                            | c2    |
++-------------------------------+-------+
+| 2016-06-28 14:53:26.554369000 | true  |
+| 2016-06-29 14:53:26.554369000 | false |
++-------------------------------+-------+
+
+</code></pre>
+
+        See <a class="xref" 
href="impala_parquet_fallback_schema_resolution.html#parquet_fallback_schema_resolution">PARQUET_FALLBACK_SCHEMA_RESOLUTION
 Query Option (Impala 2.6 or higher only)</a>
+        for more details.
+      </div>
+
+    </div>
+
+  </article>
+
+  <article class="topic concept nested1" aria-labelledby="ariaid-title18" 
id="parquet__parquet_data_types">
+
+    <h2 class="title topictitle2" id="ariaid-title18">Data Type Considerations 
for Parquet Tables</h2>
+
+    <div class="body conbody">
+
+      <p class="p">
+        The Parquet format defines a set of data types whose names differ from 
the names of the corresponding
+        Impala data types. If you are preparing Parquet files using other 
Hadoop components such as Pig or
+        MapReduce, you might need to work with the type names defined by 
Parquet. The following figure lists the
+        Parquet-defined types and the equivalent types in Impala.
+      </p>
+
+      <p class="p">
+        <strong class="ph b">Primitive types:</strong>
+      </p>
+
+<pre class="pre codeblock"><code>BINARY -&gt; STRING
+BOOLEAN -&gt; BOOLEAN
+DOUBLE -&gt; DOUBLE
+FLOAT -&gt; FLOAT
+INT32 -&gt; INT
+INT64 -&gt; BIGINT
+INT96 -&gt; TIMESTAMP
+</code></pre>
+
+      <p class="p">
+        <strong class="ph b">Logical types:</strong>
+      </p>
+
+<pre class="pre codeblock"><code>BINARY + OriginalType UTF8 -&gt; STRING
+BINARY + OriginalType DECIMAL -&gt; DECIMAL
+</code></pre>
+
+      <p class="p">
+        <strong class="ph b">Complex types:</strong>
+      </p>
+
+      <p class="p">
+        For the complex types (<code class="ph codeph">ARRAY</code>, <code 
class="ph codeph">MAP</code>, and <code class="ph codeph">STRUCT</code>)
+        available in <span class="keyword">Impala 2.3</span> and higher, 
Impala only supports queries
+        against those types in Parquet tables.
+      </p>
+
+    </div>
+
+  </article>
+
+</article></main></body></html>
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html 
b/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html
new file mode 100644
index 0000000..6f6ed71
--- /dev/null
+++ b/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html
@@ -0,0 +1,54 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="parquet_annotate_strings_utf8"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>PARQUET_ANNOTATE_STRINGS_UTF8 Query Option 
(Impala 2.6 or higher only)</title></head><body 
id="parquet_annotate_strings_utf8"><main role="main"><article role="article" 
aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" 
id="ariaid-title1">PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (<span 
class="keyword">Impala 2.6</span> or higher only)</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Causes Impala <code class="ph codeph">INSERT</code> and <code class="ph 
codeph">CREATE TABLE AS SELECT</code> statements
+      to write Parquet files that use the UTF-8 annotation for <code class="ph 
codeph">STRING</code> columns.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+    <p class="p">
+      By default, Impala represents a <code class="ph codeph">STRING</code> 
column in Parquet as an unannotated binary field.
+    </p>
+    <p class="p">
+      Impala always uses the UTF-8 annotation when writing <code class="ph 
codeph">CHAR</code> and <code class="ph codeph">VARCHAR</code>
+      columns to Parquet files. An alternative to using the query option is to 
cast <code class="ph codeph">STRING</code>
+      values to <code class="ph codeph">VARCHAR</code>.
+    </p>
+    <p class="p">
+      This option is to help make Impala-written data more interoperable with 
other data processing engines.
+      Impala itself currently does not support all operations on UTF-8 data.
+      Although data processed by Impala is typically represented in ASCII, it 
is valid to designate the
+      data as UTF-8 when storing on disk, because ASCII is a subset of UTF-8.
+    </p>
+    <p class="p">
+        <strong class="ph b">Type:</strong> Boolean; recognized values are 1 
and 0, or <code class="ph codeph">true</code> and <code class="ph 
codeph">false</code>;
+        any other value interpreted as <code class="ph codeph">false</code>
+      </p>
+    <p class="p">
+        <strong class="ph b">Default:</strong> <code class="ph 
codeph">false</code> (shown as 0 in output of <code class="ph 
codeph">SET</code> statement)
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Added in:</strong> <span class="keyword">Impala 
2.6.0</span>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+    <p class="p">
+      <a class="xref" href="impala_parquet.html#parquet">Using the Parquet 
File Format with Impala Tables</a>
+    </p>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_compression_codec.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_parquet_compression_codec.html 
b/docs/build/html/topics/impala_parquet_compression_codec.html
new file mode 100644
index 0000000..34ae693
--- /dev/null
+++ b/docs/build/html/topics/impala_parquet_compression_codec.html
@@ -0,0 +1,17 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="parquet_compression_codec"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>PARQUET_COMPRESSION_CODEC Query 
Option</title></head><body id="parquet_compression_codec"><main 
role="main"><article role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">PARQUET_COMPRESSION_CODEC 
Query Option</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Deprecated. Use <code class="ph codeph">COMPRESSION_CODEC</code> in 
Impala 2.0 and later. See
+      <a class="xref" 
href="impala_compression_codec.html#compression_codec">COMPRESSION_CODEC Query 
Option (Impala 2.0 or higher only)</a> for details.
+    </p>
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html
----------------------------------------------------------------------
diff --git 
a/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html 
b/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html
new file mode 100644
index 0000000..91abf35
--- /dev/null
+++ b/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html
@@ -0,0 +1,46 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="parquet_fallback_schema_resolution"><link rel="stylesheet" 
type="text/css" 
href="../commonltr.css"><title>PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option 
(Impala 2.6 or higher only)</title></head><body 
id="parquet_fallback_schema_resolution"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" 
id="ariaid-title1">PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (<span 
class="keyword">Impala 2.6</span> or higher only)</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Allows Impala to look up columns within Parquet files by column name, 
rather than column order,
+      when necessary.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+    <p class="p">
+      By default, Impala looks up columns within a Parquet file based on
+      the order of columns in the table.
+      The <code class="ph codeph">name</code> setting for this option enables 
behavior for Impala queries
+      similar to the Hive setting <code class="ph 
codeph">parquet.column.index.access=false</code>.
+      It also allows Impala to query Parquet files created by Hive with the
+      <code class="ph codeph">parquet.column.index.access=false</code> setting 
in effect.
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Type:</strong> integer or string.
+      Allowed values are 0 or <code class="ph codeph">position</code> 
(default), 1 or <code class="ph codeph">name</code>.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Added in:</strong> <span class="keyword">Impala 
2.6.0</span>
+      </p>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+    <p class="p">
+      <a class="xref" 
href="impala_parquet.html#parquet_schema_evolution">Schema Evolution for 
Parquet Tables</a>
+    </p>
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_file_size.html
----------------------------------------------------------------------
diff --git a/docs/build/html/topics/impala_parquet_file_size.html 
b/docs/build/html/topics/impala_parquet_file_size.html
new file mode 100644
index 0000000..695c557
--- /dev/null
+++ b/docs/build/html/topics/impala_parquet_file_size.html
@@ -0,0 +1,93 @@
+<!DOCTYPE html
+  SYSTEM "about:legacy-compat">
+<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; 
charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) 
Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta 
name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" 
content="../topics/impala_query_options.html"><meta name="prodname" 
content="Impala"><meta name="prodname" content="Impala"><meta name="version" 
content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta 
name="DC.Format" content="XHTML"><meta name="DC.Identifier" 
content="parquet_file_size"><link rel="stylesheet" type="text/css" 
href="../commonltr.css"><title>PARQUET_FILE_SIZE Query 
Option</title></head><body id="parquet_file_size"><main role="main"><article 
role="article" aria-labelledby="ariaid-title1">
+
+  <h1 class="title topictitle1" id="ariaid-title1">PARQUET_FILE_SIZE Query 
Option</h1>
+  
+  
+
+  <div class="body conbody">
+
+    <p class="p">
+      
+      Specifies the maximum size of each Parquet data file produced by Impala 
<code class="ph codeph">INSERT</code> statements.
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Syntax:</strong>
+      </p>
+
+    <p class="p">
+      Specify the size in bytes, or with a trailing <code class="ph 
codeph">m</code> or <code class="ph codeph">g</code> character to indicate
+      megabytes or gigabytes. For example:
+    </p>
+
+<pre class="pre codeblock"><code>-- 128 megabytes.
+set PARQUET_FILE_SIZE=134217728
+INSERT OVERWRITE parquet_table SELECT * FROM text_table;
+
+-- 512 megabytes.
+set PARQUET_FILE_SIZE=512m;
+INSERT OVERWRITE parquet_table SELECT * FROM text_table;
+
+-- 1 gigabyte.
+set PARQUET_FILE_SIZE=1g;
+INSERT OVERWRITE parquet_table SELECT * FROM text_table;
+</code></pre>
+
+    <p class="p">
+        <strong class="ph b">Usage notes:</strong>
+      </p>
+
+    <p class="p">
+      With tables that are small or finely partitioned, the default Parquet 
block size (formerly 1 GB, now 256 MB
+      in Impala 2.0 and later) could be much larger than needed for each data 
file. For <code class="ph codeph">INSERT</code>
+      operations into such tables, you can increase parallelism by specifying 
a smaller
+      <code class="ph codeph">PARQUET_FILE_SIZE</code> value, resulting in 
more HDFS blocks that can be processed by different
+      nodes.
+
+    </p>
+
+    <p class="p">
+      <strong class="ph b">Type:</strong> numeric, with optional unit specifier
+    </p>
+
+    <div class="note important note_important"><span class="note__title 
importanttitle">Important:</span> 
+    <p class="p">
+      Currently, the maximum value for this setting is 1 gigabyte (<code 
class="ph codeph">1g</code>).
+      Setting a value higher than 1 gigabyte could result in errors during
+      an <code class="ph codeph">INSERT</code> operation.
+    </p>
+    </div>
+
+    <p class="p">
+      <strong class="ph b">Default:</strong> 0 (produces files with a target 
size of 256 MB; files might be larger for very wide tables)
+    </p>
+
+    <p class="p">
+        <strong class="ph b">Isilon considerations:</strong>
+      </p>
+    <div class="p">
+        Because the EMC Isilon storage devices use a global value for the 
block size
+        rather than a configurable value for each file, the <code class="ph 
codeph">PARQUET_FILE_SIZE</code>
+        query option has no effect when Impala inserts data into a table or 
partition
+        residing on Isilon storage. Use the <code class="ph codeph">isi</code> 
command to set the
+        default block size globally on the Isilon device. For example, to set 
the
+        Isilon default block size to 256 MB, the recommended size for Parquet
+        data files for Impala, issue the following command:
+<pre class="pre codeblock"><code>isi hdfs settings modify 
--default-block-size=256MB</code></pre>
+      </div>
+
+    <p class="p">
+        <strong class="ph b">Related information:</strong>
+      </p>
+
+    <p class="p">
+      For information about the Parquet file format, and how the number and 
size of data files affects query
+      performance, see <a class="xref" 
href="impala_parquet.html#parquet">Using the Parquet File Format with Impala 
Tables</a>.
+    </p>
+
+
+
+  </div>
+<nav role="navigation" class="related-links"><div class="familylinks"><div 
class="parentlink"><strong>Parent topic:</strong> <a class="link" 
href="../topics/impala_query_options.html">Query Options for the SET 
Statement</a></div></div></nav></article></main></body></html>
\ No newline at end of file

[17/51] [partial] incubator-impala git commit: IMPALA-4181 [DOCS] Publish rendered Impala documentation to ASF site

Reply via email to