http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_parquet.html b/docs/build/html/topics/impala_parquet.html new file mode 100644 index 0000000..894c97a --- /dev/null +++ b/docs/build/html/topics/impala_parquet.html @@ -0,0 +1,1392 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_file_formats.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="prodname" content= "Impala"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content= "parquet"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>Using the Parquet File Format with Impala Tables</title></head><body id="parquet"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">Using the Parquet File Format with Impala Tables</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Impala helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format + intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is + especially good for queries scanning particular columns within a table, for example to query <span class="q">"wide"</span> + tables with many columns, or to perform aggregation operations such as <code class="ph codeph">SUM()</code> and + <code class="ph codeph">AVG()</code> that need to process most or all of the values from a column. Each data file contains + the values for a set of rows (the <span class="q">"row group"</span>). Within a data file, the values from each column are + organized so that they are all adjacent, enabling good compression for the values from that column. Queries + against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O. + </p> + + <table class="table"><caption><span class="table--title-label">Table 1. </span><span class="title">Parquet Format Support in Impala</span></caption><colgroup><col style="width:10%"><col style="width:10%"><col style="width:20%"><col style="width:30%"><col style="width:30%"></colgroup><thead class="thead"> + <tr class="row"> + <th class="entry nocellnorowborder" id="parquet__entry__1"> + File Type + </th> + <th class="entry nocellnorowborder" id="parquet__entry__2"> + Format + </th> + <th class="entry nocellnorowborder" id="parquet__entry__3"> + Compression Codecs + </th> + <th class="entry nocellnorowborder" id="parquet__entry__4"> + Impala Can CREATE? + </th> + <th class="entry nocellnorowborder" id="parquet__entry__5"> + Impala Can INSERT? + </th> + </tr> + </thead><tbody class="tbody"> + <tr class="row"> + <td class="entry nocellnorowborder" headers="parquet__entry__1 "> + <a class="xref" href="impala_parquet.html#parquet">Parquet</a> + </td> + <td class="entry nocellnorowborder" headers="parquet__entry__2 "> + Structured + </td> + <td class="entry nocellnorowborder" headers="parquet__entry__3 "> + Snappy, gzip; currently Snappy by default + </td> + <td class="entry nocellnorowborder" headers="parquet__entry__4 "> + Yes. + </td> + <td class="entry nocellnorowborder" headers="parquet__entry__5 "> + Yes: <code class="ph codeph">CREATE TABLE</code>, <code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and query. + </td> + </tr> + </tbody></table> + + <p class="p toc inpage"></p> + + </div> + + + <nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_file_formats.html">How Impala Works with Hadoop File Formats</a></div></div></nav><article class="topic concept nested1" aria-labelledby="ariaid-title2" id="parquet__parquet_ddl"> + + <h2 class="title topictitle2" id="ariaid-title2">Creating Parquet Tables in Impala</h2> + + <div class="body conbody"> + + <p class="p"> + To create a table named <code class="ph codeph">PARQUET_TABLE</code> that uses the Parquet format, you would use a + command like the following, substituting your own table name, column names, and data types: + </p> + +<pre class="pre codeblock"><code>[impala-host:21000] > create table <var class="keyword varname">parquet_table_name</var> (x INT, y STRING) STORED AS PARQUET;</code></pre> + + + + <p class="p"> + Or, to clone the column names and data types of an existing table: + </p> + +<pre class="pre codeblock"><code>[impala-host:21000] > create table <var class="keyword varname">parquet_table_name</var> LIKE <var class="keyword varname">other_table_name</var> STORED AS PARQUET;</code></pre> + + <p class="p"> + In Impala 1.4.0 and higher, you can derive column definitions from a raw Parquet data file, even without an + existing Impala table. For example, you can create an external table pointing to an HDFS directory, and + base the column definitions on one of the files in that directory: + </p> + +<pre class="pre codeblock"><code>CREATE EXTERNAL TABLE ingest_existing_files LIKE PARQUET '/user/etl/destination/datafile1.dat' + STORED AS PARQUET + LOCATION '/user/etl/destination'; +</code></pre> + + <p class="p"> + Or, you can refer to an existing data file and create a new empty table with suitable column definitions. + Then you can use <code class="ph codeph">INSERT</code> to create new data files or <code class="ph codeph">LOAD DATA</code> to transfer + existing data files into the new table. + </p> + +<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat' + STORED AS PARQUET; +</code></pre> + + <p class="p"> + The default properties of the newly created table are the same as for any other <code class="ph codeph">CREATE + TABLE</code> statement. For example, the default file format is text; if you want the new table to use + the Parquet file format, include the <code class="ph codeph">STORED AS PARQUET</code> file also. + </p> + + <p class="p"> + In this example, the new table is partitioned by year, month, and day. These partition key columns are not + part of the data file, so you specify them in the <code class="ph codeph">CREATE TABLE</code> statement: + </p> + +<pre class="pre codeblock"><code>CREATE TABLE columns_from_data_file LIKE PARQUET '/user/etl/destination/datafile1.dat' + PARTITION (year INT, month TINYINT, day TINYINT) + STORED AS PARQUET; +</code></pre> + + <p class="p"> + See <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> for more details about the <code class="ph codeph">CREATE TABLE + LIKE PARQUET</code> syntax. + </p> + + <p class="p"> + Once you have created a table, to insert data into that table, use a command similar to the following, + again with your own table names: + </p> + + + +<pre class="pre codeblock"><code>[impala-host:21000] > insert overwrite table <var class="keyword varname">parquet_table_name</var> select * from <var class="keyword varname">other_table_name</var>;</code></pre> + + <p class="p"> + If the Parquet table has a different number of columns or different column names than the other table, + specify the names of columns from the other table rather than <code class="ph codeph">*</code> in the + <code class="ph codeph">SELECT</code> statement. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title3" id="parquet__parquet_etl"> + + <h2 class="title topictitle2" id="ariaid-title3">Loading Data into Parquet Tables</h2> + + + <div class="body conbody"> + + <p class="p"> + Choose from the following techniques for loading data into Parquet tables, depending on whether the + original data is already in an Impala table, or exists as raw data files outside Impala. + </p> + + <p class="p"> + If you already have data in an Impala or Hive table, perhaps in a different file format or partitioning + scheme, you can transfer the data to a Parquet table using the Impala <code class="ph codeph">INSERT...SELECT</code> + syntax. You can convert, filter, repartition, and do other things to the data as part of this same + <code class="ph codeph">INSERT</code> statement. See <a class="xref" href="#parquet_compression">Snappy and GZip Compression for Parquet Data Files</a> for some examples showing how to + insert data into Parquet tables. + </p> + + <div class="p"> + When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in + the <code class="ph codeph">INSERT</code> statement to fine-tune the overall performance of the operation and its + resource usage: + <ul class="ul"> + <li class="li"> + These hints are available in Impala 1.2.2 and higher. + </li> + + <li class="li"> + You would only use these hints if an <code class="ph codeph">INSERT</code> into a partitioned Parquet table was + failing due to capacity limits, or if such an <code class="ph codeph">INSERT</code> was succeeding but with + less-than-optimal performance. + </li> + + <li class="li"> + To use these hints, put the hint keyword <code class="ph codeph">[SHUFFLE]</code> or <code class="ph codeph">[NOSHUFFLE]</code> + (including the square brackets) after the <code class="ph codeph">PARTITION</code> clause, immediately before the + <code class="ph codeph">SELECT</code> keyword. + </li> + + <li class="li"> + <code class="ph codeph">[SHUFFLE]</code> selects an execution plan that minimizes the number of files being written + simultaneously to HDFS, and the number of memory buffers holding data for individual partitions. Thus + it reduces overall resource usage for the <code class="ph codeph">INSERT</code> operation, allowing some + <code class="ph codeph">INSERT</code> operations to succeed that otherwise would fail. It does involve some data + transfer between the nodes so that the data files for a particular partition are all constructed on the + same node. + </li> + + <li class="li"> + <code class="ph codeph">[NOSHUFFLE]</code> selects an execution plan that might be faster overall, but might also + produce a larger number of small data files or exceed capacity limits, causing the + <code class="ph codeph">INSERT</code> operation to fail. Use <code class="ph codeph">[SHUFFLE]</code> in cases where an + <code class="ph codeph">INSERT</code> statement fails or runs inefficiently due to all nodes attempting to construct + data for all partitions. + </li> + + <li class="li"> + Impala automatically uses the <code class="ph codeph">[SHUFFLE]</code> method if any partition key column in the + source table, mentioned in the <code class="ph codeph">INSERT ... SELECT</code> query, does not have column + statistics. In this case, only the <code class="ph codeph">[NOSHUFFLE]</code> hint would have any effect. + </li> + + <li class="li"> + If column statistics are available for all partition key columns in the source table mentioned in the + <code class="ph codeph">INSERT ... SELECT</code> query, Impala chooses whether to use the <code class="ph codeph">[SHUFFLE]</code> + or <code class="ph codeph">[NOSHUFFLE]</code> technique based on the estimated number of distinct values in those + columns and the number of nodes involved in the <code class="ph codeph">INSERT</code> operation. In this case, you + might need the <code class="ph codeph">[SHUFFLE]</code> or the <code class="ph codeph">[NOSHUFFLE]</code> hint to override the + execution plan selected by Impala. + </li> + </ul> + </div> + + <p class="p"> + Any <code class="ph codeph">INSERT</code> statement for a Parquet table requires enough free space in the HDFS filesystem + to write one block. Because Parquet data files use a block size of 1 GB by default, an + <code class="ph codeph">INSERT</code> might fail (even for a very small amount of data) if your HDFS is running low on + space. + </p> + + + + <p class="p"> + Avoid the <code class="ph codeph">INSERT...VALUES</code> syntax for Parquet tables, because + <code class="ph codeph">INSERT...VALUES</code> produces a separate tiny data file for each + <code class="ph codeph">INSERT...VALUES</code> statement, and the strength of Parquet is in its handling of data + (compressing, parallelizing, and so on) in <span class="ph">large</span> chunks. + </p> + + <p class="p"> + If you have one or more Parquet data files produced outside of Impala, you can quickly make the data + queryable through Impala by one of the following methods: + </p> + + <ul class="ul"> + <li class="li"> + The <code class="ph codeph">LOAD DATA</code> statement moves a single data file or a directory full of data files into + the data directory for an Impala table. It does no validation or conversion of the data. The original + data files must be somewhere in HDFS, not the local filesystem. + + </li> + + <li class="li"> + The <code class="ph codeph">CREATE TABLE</code> statement with the <code class="ph codeph">LOCATION</code> clause creates a table + where the data continues to reside outside the Impala data directory. The original data files must be + somewhere in HDFS, not the local filesystem. For extra safety, if the data is intended to be long-lived + and reused by other applications, you can use the <code class="ph codeph">CREATE EXTERNAL TABLE</code> syntax so that + the data files are not deleted by an Impala <code class="ph codeph">DROP TABLE</code> statement. + + </li> + + <li class="li"> + If the Parquet table already exists, you can copy Parquet data files directly into it, then use the + <code class="ph codeph">REFRESH</code> statement to make Impala recognize the newly added data. Remember to preserve + the block size of the Parquet data files by using the <code class="ph codeph">hadoop distcp -pb</code> command rather + than a <code class="ph codeph">-put</code> or <code class="ph codeph">-cp</code> operation on the Parquet files. See + <a class="xref" href="#parquet_compression_multiple">Example of Copying Parquet Data Files</a> for an example of this kind of operation. + </li> + </ul> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Currently, Impala always decodes the column data in Parquet files based on the ordinal position of the + columns, not by looking up the position of each column based on its name. Parquet files produced outside + of Impala must write column data in the same order as the columns are declared in the Impala table. Any + optional columns that are omitted from the data files must be the rightmost columns in the Impala table + definition. + </p> + + <p class="p"> + If you created compressed Parquet files through some tool other than Impala, make sure that any + compression codecs are supported in Parquet by Impala. For example, Impala does not currently support LZO + compression in Parquet files. Also doublecheck that you used any recommended compatibility settings in + the other tool, such as <code class="ph codeph">spark.sql.parquet.binaryAsString</code> when writing Parquet files + through Spark. + </p> + </div> + + <p class="p"> + Recent versions of Sqoop can produce Parquet output files using the <code class="ph codeph">--as-parquetfile</code> + option. + </p> + + <p class="p"> If you use Sqoop to + convert RDBMS data to Parquet, be careful with interpreting any + resulting values from <code class="ph codeph">DATE</code>, <code class="ph codeph">DATETIME</code>, + or <code class="ph codeph">TIMESTAMP</code> columns. The underlying values are + represented as the Parquet <code class="ph codeph">INT64</code> type, which is + represented as <code class="ph codeph">BIGINT</code> in the Impala table. The Parquet + values represent the time in milliseconds, while Impala interprets + <code class="ph codeph">BIGINT</code> as the time in seconds. Therefore, if you have + a <code class="ph codeph">BIGINT</code> column in a Parquet table that was imported + this way from Sqoop, divide the values by 1000 when interpreting as the + <code class="ph codeph">TIMESTAMP</code> type.</p> + + <p class="p"> + If the data exists outside Impala and is in some other format, combine both of the preceding techniques. + First, use a <code class="ph codeph">LOAD DATA</code> or <code class="ph codeph">CREATE EXTERNAL TABLE ... LOCATION</code> statement to + bring the data into an Impala table that uses the appropriate file format. Then, use an + <code class="ph codeph">INSERT...SELECT</code> statement to copy the data to the Parquet table, converting to Parquet + format as part of the process. + </p> + + + + <p class="p"> + Loading data into Parquet tables is a memory-intensive operation, because the incoming data is buffered + until it reaches <span class="ph">one data block</span> in size, then that chunk of data is + organized and compressed in memory before being written out. The memory consumption can be larger when + inserting data into partitioned Parquet tables, because a separate data file is written for each + combination of partition key column values, potentially requiring several + <span class="ph">large</span> chunks to be manipulated in memory at once. + </p> + + <p class="p"> + When inserting into a partitioned Parquet table, Impala redistributes the data among the nodes to reduce + memory consumption. You might still need to temporarily increase the memory dedicated to Impala during the + insert operation, or break up the load operation into several <code class="ph codeph">INSERT</code> statements, or both. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + All the preceding techniques assume that the data you are loading matches the structure of the destination + table, including column order, column names, and partition layout. To transform or reorganize the data, + start by loading the data into a Parquet table that matches the underlying structure of the data, then use + one of the table-copying techniques such as <code class="ph codeph">CREATE TABLE AS SELECT</code> or <code class="ph codeph">INSERT ... + SELECT</code> to reorder or rename columns, divide the data among multiple partitions, and so on. For + example to take a single comprehensive Parquet data file and load it into a partitioned table, you would + use an <code class="ph codeph">INSERT ... SELECT</code> statement with dynamic partitioning to let Impala create separate + data files with the appropriate partition values; for an example, see + <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>. + </div> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title4" id="parquet__parquet_performance"> + + <h2 class="title topictitle2" id="ariaid-title4">Query Performance for Impala Parquet Tables</h2> + + + <div class="body conbody"> + + <p class="p"> + Query performance for Parquet tables depends on the number of columns needed to process the + <code class="ph codeph">SELECT</code> list and <code class="ph codeph">WHERE</code> clauses of the query, the way data is divided into + <span class="ph">large data files with block size equal to file size</span>, the reduction in I/O + by reading the data for each column in compressed format, which data files can be skipped (for partitioned + tables), and the CPU overhead of decompressing the data for each column. + </p> + + <div class="p"> + For example, the following is an efficient query for a Parquet table: +<pre class="pre codeblock"><code>select avg(income) from census_data where state = 'CA';</code></pre> + The query processes only 2 columns out of a large number of total columns. If the table is partitioned by + the <code class="ph codeph">STATE</code> column, it is even more efficient because the query only has to read and decode + 1 column from each data file, and it can read only the data files in the partition directory for the state + <code class="ph codeph">'CA'</code>, skipping the data files for all the other states, which will be physically located + in other directories. + </div> + + <div class="p"> + The following is a relatively inefficient query for a Parquet table: +<pre class="pre codeblock"><code>select * from census_data;</code></pre> + Impala would have to read the entire contents of each <span class="ph">large</span> data file, + and decompress the contents of each column for each row group, negating the I/O optimizations of the + column-oriented format. This query might still be faster for a Parquet table than a table with some other + file format, but it does not take advantage of the unique strengths of Parquet data files. + </div> + + <p class="p"> + Impala can optimize queries on Parquet tables, especially join queries, better when statistics are + available for all the tables. Issue the <code class="ph codeph">COMPUTE STATS</code> statement for each table after + substantial amounts of data are loaded into or appended to it. See + <a class="xref" href="impala_compute_stats.html#compute_stats">COMPUTE STATS Statement</a> for details. + </p> + + <p class="p"> + The runtime filtering feature, available in <span class="keyword">Impala 2.5</span> and higher, works best with Parquet tables. + The per-row filtering aspect only applies to Parquet tables. + See <a class="xref" href="impala_runtime_filtering.html#runtime_filtering">Runtime Filtering for Impala Queries (Impala 2.5 or higher only)</a> for details. + </p> + + <p class="p"> + In <span class="keyword">Impala 2.6</span> and higher, Impala queries are optimized for files stored in Amazon S3. + For Impala tables that use the file formats Parquet, RCFile, SequenceFile, + Avro, and uncompressed text, the setting <code class="ph codeph">fs.s3a.block.size</code> + in the <span class="ph filepath">core-site.xml</span> configuration file determines + how Impala divides the I/O work of reading the data files. This configuration + setting is specified in bytes. By default, this + value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files + as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access + Parquet files written by MapReduce or Hive, increase <code class="ph codeph">fs.s3a.block.size</code> + to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve + Parquet files written by Impala, increase <code class="ph codeph">fs.s3a.block.size</code> + to 268435456 (256 MB) to match the row group size produced by Impala. + </p> + + </div> + + <article class="topic concept nested2" aria-labelledby="ariaid-title5" id="parquet_performance__parquet_partitioning"> + + <h3 class="title topictitle3" id="ariaid-title5">Partitioning for Parquet Tables</h3> + + <div class="body conbody"> + + <p class="p"> + As explained in <a class="xref" href="impala_partitioning.html#partitioning">Partitioning for Impala Tables</a>, partitioning is an important + performance technique for Impala generally. This section explains some of the performance considerations + for partitioned Parquet tables. + </p> + + <p class="p"> + The Parquet file format is ideal for tables containing many columns, where most queries only refer to a + small subset of the columns. As explained in <a class="xref" href="#parquet_data_files">How Parquet Data Files Are Organized</a>, the physical layout of + Parquet data files lets Impala read only a small fraction of the data for many queries. The performance + benefits of this approach are amplified when you use Parquet tables in combination with partitioning. + Impala can skip the data files for certain partitions entirely, based on the comparisons in the + <code class="ph codeph">WHERE</code> clause that refer to the partition key columns. For example, queries on + partitioned tables often analyze data for time intervals based on columns such as <code class="ph codeph">YEAR</code>, + <code class="ph codeph">MONTH</code>, and/or <code class="ph codeph">DAY</code>, or for geographic regions. Remember that Parquet + data files use a <span class="ph">large</span> block size, so when deciding how finely to + partition the data, try to find a granularity where each partition contains + <span class="ph">256 MB</span> or more of data, rather than creating a large number of smaller + files split among many partitions. + </p> + + <p class="p"> + Inserting into a partitioned Parquet table can be a resource-intensive operation, because each Impala + node could potentially be writing a separate data file to HDFS for each combination of different values + for the partition key columns. The large number of simultaneous open files could exceed the HDFS + <span class="q">"transceivers"</span> limit. To avoid exceeding this limit, consider the following techniques: + </p> + + <ul class="ul"> + <li class="li"> + Load different subsets of data using separate <code class="ph codeph">INSERT</code> statements with specific values + for the <code class="ph codeph">PARTITION</code> clause, such as <code class="ph codeph">PARTITION (year=2010)</code>. + </li> + + <li class="li"> + Increase the <span class="q">"transceivers"</span> value for HDFS, sometimes spelled <span class="q">"xcievers"</span> (sic). The property + value in the <span class="ph filepath">hdfs-site.xml</span> configuration file is + + <code class="ph codeph">dfs.datanode.max.transfer.threads</code>. For example, if you were loading 12 years of data + partitioned by year, month, and day, even a value of 4096 might not be high enough. This + <a class="xref" href="http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/" target="_blank">blog post</a> explores the considerations for setting this value + higher or lower, using HBase examples for illustration. + </li> + + <li class="li"> + Use the <code class="ph codeph">COMPUTE STATS</code> statement to collect + <a class="xref" href="impala_perf_stats.html#perf_column_stats">column statistics</a> on the source table from + which data is being copied, so that the Impala query can estimate the number of different values in the + partition key columns and distribute the work accordingly. + </li> + </ul> + + </div> + + </article> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title6" id="parquet__parquet_compression"> + + <h2 class="title topictitle2" id="ariaid-title6">Snappy and GZip Compression for Parquet Data Files</h2> + + + <div class="body conbody"> + + <p class="p"> + + When Impala writes Parquet data files using the <code class="ph codeph">INSERT</code> statement, the underlying + compression is controlled by the <code class="ph codeph">COMPRESSION_CODEC</code> query option. (Prior to Impala 2.0, the + query option name was <code class="ph codeph">PARQUET_COMPRESSION_CODEC</code>.) The allowed values for this query option + are <code class="ph codeph">snappy</code> (the default), <code class="ph codeph">gzip</code>, and <code class="ph codeph">none</code>. The option + value is not case-sensitive. If the option is set to an unrecognized value, all kinds of queries will fail + due to the invalid option setting, not just queries involving Parquet tables. + </p> + + </div> + + <article class="topic concept nested2" aria-labelledby="ariaid-title7" id="parquet_compression__parquet_snappy"> + + <h3 class="title topictitle3" id="ariaid-title7">Example of Parquet Table with Snappy Compression</h3> + + <div class="body conbody"> + + <p class="p"> + + By default, the underlying data files for a Parquet table are compressed with Snappy. The combination of + fast compression and decompression makes it a good choice for many data sets. To ensure Snappy + compression is used, for example after experimenting with other compression codecs, set the + <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">snappy</code> before inserting the data: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create database parquet_compression; +[localhost:21000] > use parquet_compression; +[localhost:21000] > create table parquet_snappy like raw_text_data; +[localhost:21000] > set COMPRESSION_CODEC=snappy; +[localhost:21000] > insert into parquet_snappy select * from raw_text_data; +Inserted 1000000000 rows in 181.98s +</code></pre> + + </div> + + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title8" id="parquet_compression__parquet_gzip"> + + <h3 class="title topictitle3" id="ariaid-title8">Example of Parquet Table with GZip Compression</h3> + + <div class="body conbody"> + + <p class="p"> + If you need more intensive compression (at the expense of more CPU cycles for uncompressing during + queries), set the <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">gzip</code> before + inserting the data: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table parquet_gzip like raw_text_data; +[localhost:21000] > set COMPRESSION_CODEC=gzip; +[localhost:21000] > insert into parquet_gzip select * from raw_text_data; +Inserted 1000000000 rows in 1418.24s +</code></pre> + + </div> + + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title9" id="parquet_compression__parquet_none"> + + <h3 class="title topictitle3" id="ariaid-title9">Example of Uncompressed Parquet Table</h3> + + <div class="body conbody"> + + <p class="p"> + If your data compresses very poorly, or you want to avoid the CPU overhead of compression and + decompression entirely, set the <code class="ph codeph">COMPRESSION_CODEC</code> query option to <code class="ph codeph">none</code> + before inserting the data: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table parquet_none like raw_text_data; +[localhost:21000] > set COMPRESSION_CODEC=none; +[localhost:21000] > insert into parquet_none select * from raw_text_data; +Inserted 1000000000 rows in 146.90s +</code></pre> + + </div> + + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title10" id="parquet_compression__parquet_compression_examples"> + + <h3 class="title topictitle3" id="ariaid-title10">Examples of Sizes and Speeds for Compressed Parquet Tables</h3> + + <div class="body conbody"> + + <p class="p"> + Here are some examples showing differences in data sizes and query speeds for 1 billion rows of synthetic + data, compressed with each kind of codec. As always, run similar tests with realistic data sets of your + own. The actual compression ratios, and relative insert and query speeds, will vary depending on the + characteristics of the actual data. + </p> + + <p class="p"> + In this case, switching from Snappy to GZip compression shrinks the data by an additional 40% or so, + while switching from Snappy compression to no compression expands the data also by about 40%: + </p> + +<pre class="pre codeblock"><code>$ hdfs dfs -du -h /user/hive/warehouse/parquet_compression.db +23.1 G /user/hive/warehouse/parquet_compression.db/parquet_snappy +13.5 G /user/hive/warehouse/parquet_compression.db/parquet_gzip +32.8 G /user/hive/warehouse/parquet_compression.db/parquet_none +</code></pre> + + <p class="p"> + Because Parquet data files are typically <span class="ph">large</span>, each directory will + have a different number of data files and the row groups will be arranged differently. + </p> + + <p class="p"> + At the same time, the less agressive the compression, the faster the data can be decompressed. In this + case using a table with a billion rows, a query that evaluates all the values for a particular column + runs faster with no compression than with Snappy compression, and faster with Snappy compression than + with Gzip compression. Query performance depends on several other factors, so as always, run your own + benchmarks with your own data to determine the ideal tradeoff between data size, CPU efficiency, and + speed of insert and query operations. + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > desc parquet_snappy; +Query finished, fetching results ... ++-----------+---------+---------+ +| name | type | comment | ++-----------+---------+---------+ +| id | int | | +| val | int | | +| zfill | string | | +| name | string | | +| assertion | boolean | | ++-----------+---------+---------+ +Returned 5 row(s) in 0.14s +[localhost:21000] > select avg(val) from parquet_snappy; +Query finished, fetching results ... ++-----------------+ +| _c0 | ++-----------------+ +| 250000.93577915 | ++-----------------+ +Returned 1 row(s) in 4.29s +[localhost:21000] > select avg(val) from parquet_gzip; +Query finished, fetching results ... ++-----------------+ +| _c0 | ++-----------------+ +| 250000.93577915 | ++-----------------+ +Returned 1 row(s) in 6.97s +[localhost:21000] > select avg(val) from parquet_none; +Query finished, fetching results ... ++-----------------+ +| _c0 | ++-----------------+ +| 250000.93577915 | ++-----------------+ +Returned 1 row(s) in 3.67s +</code></pre> + + </div> + + </article> + + <article class="topic concept nested2" aria-labelledby="ariaid-title11" id="parquet_compression__parquet_compression_multiple"> + + <h3 class="title topictitle3" id="ariaid-title11">Example of Copying Parquet Data Files</h3> + + <div class="body conbody"> + + <p class="p"> + Here is a final example, to illustrate how the data files using the various compression codecs are all + compatible with each other for read operations. The metadata about the compression format is written into + each data file, and can be decoded during queries regardless of the <code class="ph codeph">COMPRESSION_CODEC</code> + setting in effect at the time. In this example, we copy data files from the + <code class="ph codeph">PARQUET_SNAPPY</code>, <code class="ph codeph">PARQUET_GZIP</code>, and <code class="ph codeph">PARQUET_NONE</code> tables + used in the previous examples, each containing 1 billion rows, all to the data directory of a new table + <code class="ph codeph">PARQUET_EVERYTHING</code>. A couple of sample queries demonstrate that the new table now + contains 3 billion rows featuring a variety of compression codecs for the data files. + </p> + + <p class="p"> + First, we create the table in Impala so that there is a destination directory in HDFS to put the data + files: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > create table parquet_everything like parquet_snappy; +Query: create table parquet_everything like parquet_snappy +</code></pre> + + <p class="p"> + Then in the shell, we copy the relevant data files into the data directory for this new table. Rather + than using <code class="ph codeph">hdfs dfs -cp</code> as with typical files, we use <code class="ph codeph">hadoop distcp -pb</code> + to ensure that the special <span class="ph"> block size</span> of the Parquet data files is + preserved. + </p> + +<pre class="pre codeblock"><code>$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_snappy \ + /user/hive/warehouse/parquet_compression.db/parquet_everything +...<var class="keyword varname">MapReduce output</var>... +$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_gzip \ + /user/hive/warehouse/parquet_compression.db/parquet_everything +...<var class="keyword varname">MapReduce output</var>... +$ hadoop distcp -pb /user/hive/warehouse/parquet_compression.db/parquet_none \ + /user/hive/warehouse/parquet_compression.db/parquet_everything +...<var class="keyword varname">MapReduce output</var>... +</code></pre> + + <p class="p"> + Back in the <span class="keyword cmdname">impala-shell</span> interpreter, we use the <code class="ph codeph">REFRESH</code> statement to + alert the Impala server to the new data files for this table, then we can run queries demonstrating that + the data files represent 3 billion rows, and the values for one of the numeric columns match what was in + the original smaller tables: + </p> + +<pre class="pre codeblock"><code>[localhost:21000] > refresh parquet_everything; +Query finished, fetching results ... + +Returned 0 row(s) in 0.32s +[localhost:21000] > select count(*) from parquet_everything; +Query finished, fetching results ... ++------------+ +| _c0 | ++------------+ +| 3000000000 | ++------------+ +Returned 1 row(s) in 8.18s +[localhost:21000] > select avg(val) from parquet_everything; +Query finished, fetching results ... ++-----------------+ +| _c0 | ++-----------------+ +| 250000.93577915 | ++-----------------+ +Returned 1 row(s) in 13.35s +</code></pre> + + </div> + + </article> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title12" id="parquet__parquet_complex_types"> + + <h2 class="title topictitle2" id="ariaid-title12">Parquet Tables for Impala Complex Types</h2> + + <div class="body conbody"> + + <p class="p"> + In <span class="keyword">Impala 2.3</span> and higher, Impala supports the complex types + <code class="ph codeph">ARRAY</code>, <code class="ph codeph">STRUCT</code>, and <code class="ph codeph">MAP</code> + See <a class="xref" href="impala_complex_types.html#complex_types">Complex Types (Impala 2.3 or higher only)</a> for details. + Because these data types are currently supported only for the Parquet file format, + if you plan to use them, become familiar with the performance and storage aspects + of Parquet first. + </p> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title13" id="parquet__parquet_interop"> + + <h2 class="title topictitle2" id="ariaid-title13">Exchanging Parquet Data Files with Other Hadoop Components</h2> + + + <div class="body conbody"> + + <p class="p"> + You can read and write Parquet data files from other <span class="keyword"></span> components. + See <span class="xref">the documentation for your Apache Hadoop distribution</span> for details. + </p> + + + + + + + + + + <p class="p"> + Previously, it was not possible to create Parquet data through Impala and reuse that table within Hive. Now + that Parquet support is available for Hive, reusing existing Impala Parquet data files in Hive + requires updating the table metadata. Use the following command if you are already running Impala 1.1.1 or + higher: + </p> + +<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword varname">table_name</var> SET FILEFORMAT PARQUET; +</code></pre> + + <p class="p"> + If you are running a level of Impala that is older than 1.1.1, do the metadata update through Hive: + </p> + +<pre class="pre codeblock"><code>ALTER TABLE <var class="keyword varname">table_name</var> SET SERDE 'parquet.hive.serde.ParquetHiveSerDe'; +ALTER TABLE <var class="keyword varname">table_name</var> SET FILEFORMAT + INPUTFORMAT "parquet.hive.DeprecatedParquetInputFormat" + OUTPUTFORMAT "parquet.hive.DeprecatedParquetOutputFormat"; +</code></pre> + + <p class="p"> + Impala 1.1.1 and higher can reuse Parquet data files created by Hive, without any action required. + </p> + + + + <p class="p"> + Impala supports the scalar data types that you can encode in a Parquet data file, but not composite or + nested types such as maps or arrays. In <span class="keyword">Impala 2.2</span> and higher, Impala can query Parquet data + files that include composite or nested types, as long as the query only refers to columns with scalar + types. + + </p> + + <p class="p"> + If you copy Parquet data files between nodes, or even between different directories on the same node, make + sure to preserve the block size by using the command <code class="ph codeph">hadoop distcp -pb</code>. To verify that the + block size was preserved, issue the command <code class="ph codeph">hdfs fsck -blocks + <var class="keyword varname">HDFS_path_of_impala_table_dir</var></code> and check that the average block size is at or + near <span class="ph">256 MB (or whatever other size is defined by the + <code class="ph codeph">PARQUET_FILE_SIZE</code> query option).</span>. (The <code class="ph codeph">hadoop distcp</code> operation + typically leaves some directories behind, with names matching <span class="ph filepath">_distcp_logs_*</span>, that you + can delete from the destination directory afterward.) + + + + Issue the command <span class="keyword cmdname">hadoop distcp</span> for details about <span class="keyword cmdname">distcp</span> command + syntax. + </p> + + + + <p class="p"> + Impala can query Parquet files that use the <code class="ph codeph">PLAIN</code>, <code class="ph codeph">PLAIN_DICTIONARY</code>, + <code class="ph codeph">BIT_PACKED</code>, and <code class="ph codeph">RLE</code> encodings. + Currently, Impala does not support <code class="ph codeph">RLE_DICTIONARY</code> encoding. + When creating files outside of Impala for use by Impala, make sure to use one of the supported encodings. + In particular, for MapReduce jobs, <code class="ph codeph">parquet.writer.version</code> must not be defined + (especially as <code class="ph codeph">PARQUET_2_0</code>) for writing the configurations of Parquet MR jobs. + Use the default version (or format). The default format, 1.0, includes some enhancements that are compatible with older versions. + Data using the 2.0 format might not be consumable by Impala, due to use of the <code class="ph codeph">RLE_DICTIONARY</code> encoding. + </p> + <div class="p"> + To examine the internal structure and data of Parquet files, you can use the + <span class="keyword cmdname">parquet-tools</span> command. Make sure this + command is in your <code class="ph codeph">$PATH</code>. (Typically, it is symlinked from + <span class="ph filepath">/usr/bin</span>; sometimes, depending on your installation setup, you + might need to locate it under an alternative <code class="ph codeph">bin</code> directory.) + The arguments to this command let you perform operations such as: + <ul class="ul"> + <li class="li"> + <code class="ph codeph">cat</code>: Print a file's contents to standard out. In <span class="keyword">Impala 2.3</span> and higher, you can use + the <code class="ph codeph">-j</code> option to output JSON. + </li> + <li class="li"> + <code class="ph codeph">head</code>: Print the first few records of a file to standard output. + </li> + <li class="li"> + <code class="ph codeph">schema</code>: Print the Parquet schema for the file. + </li> + <li class="li"> + <code class="ph codeph">meta</code>: Print the file footer metadata, including key-value properties (like Avro schema), compression ratios, + encodings, compression used, and row group information. + </li> + <li class="li"> + <code class="ph codeph">dump</code>: Print all data and metadata. + </li> + </ul> + Use <code class="ph codeph">parquet-tools -h</code> to see usage information for all the arguments. + Here are some examples showing <span class="keyword cmdname">parquet-tools</span> usage: + +<pre class="pre codeblock"><code> +$ # Be careful doing this for a big file! Use parquet-tools head to be safe. +$ parquet-tools cat sample.parq +year = 1992 +month = 1 +day = 2 +dayofweek = 4 +dep_time = 748 +crs_dep_time = 750 +arr_time = 851 +crs_arr_time = 846 +carrier = US +flight_num = 53 +actual_elapsed_time = 63 +crs_elapsed_time = 56 +arrdelay = 5 +depdelay = -2 +origin = CMH +dest = IND +distance = 182 +cancelled = 0 +diverted = 0 + +year = 1992 +month = 1 +day = 3 +... + +</code></pre> + +<pre class="pre codeblock"><code> +$ parquet-tools head -n 2 sample.parq +year = 1992 +month = 1 +day = 2 +dayofweek = 4 +dep_time = 748 +crs_dep_time = 750 +arr_time = 851 +crs_arr_time = 846 +carrier = US +flight_num = 53 +actual_elapsed_time = 63 +crs_elapsed_time = 56 +arrdelay = 5 +depdelay = -2 +origin = CMH +dest = IND +distance = 182 +cancelled = 0 +diverted = 0 + +year = 1992 +month = 1 +day = 3 +... + +</code></pre> + +<pre class="pre codeblock"><code> +$ parquet-tools schema sample.parq +message schema { + optional int32 year; + optional int32 month; + optional int32 day; + optional int32 dayofweek; + optional int32 dep_time; + optional int32 crs_dep_time; + optional int32 arr_time; + optional int32 crs_arr_time; + optional binary carrier; + optional int32 flight_num; +... + +</code></pre> + +<pre class="pre codeblock"><code> +$ parquet-tools meta sample.parq +creator: impala version 2.2.0-... + +file schema: schema +------------------------------------------------------------------- +year: OPTIONAL INT32 R:0 D:1 +month: OPTIONAL INT32 R:0 D:1 +day: OPTIONAL INT32 R:0 D:1 +dayofweek: OPTIONAL INT32 R:0 D:1 +dep_time: OPTIONAL INT32 R:0 D:1 +crs_dep_time: OPTIONAL INT32 R:0 D:1 +arr_time: OPTIONAL INT32 R:0 D:1 +crs_arr_time: OPTIONAL INT32 R:0 D:1 +carrier: OPTIONAL BINARY R:0 D:1 +flight_num: OPTIONAL INT32 R:0 D:1 +... + +row group 1: RC:20636601 TS:265103674 +------------------------------------------------------------------- +year: INT32 SNAPPY DO:4 FPO:35 SZ:10103/49723/4.92 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +month: INT32 SNAPPY DO:10147 FPO:10210 SZ:11380/35732/3.14 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +day: INT32 SNAPPY DO:21572 FPO:21714 SZ:3071658/9868452/3.21 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +dayofweek: INT32 SNAPPY DO:3093276 FPO:3093319 SZ:2274375/5941876/2.61 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +dep_time: INT32 SNAPPY DO:5367705 FPO:5373967 SZ:28281281/28573175/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +crs_dep_time: INT32 SNAPPY DO:33649039 FPO:33654262 SZ:10220839/11574964/1.13 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +arr_time: INT32 SNAPPY DO:43869935 FPO:43876489 SZ:28562410/28797767/1.01 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +crs_arr_time: INT32 SNAPPY DO:72432398 FPO:72438151 SZ:10908972/12164626/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +carrier: BINARY SNAPPY DO:83341427 FPO:83341558 SZ:114916/128611/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +flight_num: INT32 SNAPPY DO:83456393 FPO:83488603 SZ:10216514/11474301/1.12 VC:20636601 ENC:PLAIN_DICTIONARY,RLE,PLAIN +... + +</code></pre> + </div> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title14" id="parquet__parquet_data_files"> + + <h2 class="title topictitle2" id="ariaid-title14">How Parquet Data Files Are Organized</h2> + + + <div class="body conbody"> + + <p class="p"> + Although Parquet is a column-oriented file format, do not expect to find one data file for each column. + Parquet keeps all the data for a row within the same data file, to ensure that the columns for a row are + always available on the same node for processing. What Parquet does is to set a large HDFS block size and a + matching maximum data file size, to ensure that I/O and network transfer requests apply to large batches of + data. + </p> + + <p class="p"> + Within that data file, the data for a set of rows is rearranged so that all the values from the first + column are organized in one contiguous block, then all the values from the second column, and so on. + Putting the values from the same column next to each other lets Impala use effective compression techniques + on the values in that column. + </p> + + <div class="note note note_note"><span class="note__title notetitle">Note:</span> + <p class="p"> + Impala <code class="ph codeph">INSERT</code> statements write Parquet data files using an HDFS block size + <span class="ph">that matches the data file size</span>, to ensure that each data file is + represented by a single HDFS block, and the entire file can be processed on a single node without + requiring any remote reads. + </p> + + <p class="p"> + If you create Parquet data files outside of Impala, such as through a MapReduce or Pig job, ensure that + the HDFS block size is greater than or equal to the file size, so that the <span class="q">"one file per block"</span> + relationship is maintained. Set the <code class="ph codeph">dfs.block.size</code> or the <code class="ph codeph">dfs.blocksize</code> + property large enough that each file fits within a single HDFS block, even if that size is larger than + the normal HDFS block size. + </p> + + <p class="p"> + If the block size is reset to a lower value during a file copy, you will see lower performance for + queries involving those files, and the <code class="ph codeph">PROFILE</code> statement will reveal that some I/O is + being done suboptimally, through remote reads. See + <a class="xref" href="impala_parquet.html#parquet_compression_multiple">Example of Copying Parquet Data Files</a> for an example showing how to preserve the + block size when copying Parquet data files. + </p> + </div> + + <p class="p"> + When Impala retrieves or tests the data for a particular column, it opens all the data files, but only + reads the portion of each file containing the values for that column. The column values are stored + consecutively, minimizing the I/O required to process the values within a single column. If other columns + are named in the <code class="ph codeph">SELECT</code> list or <code class="ph codeph">WHERE</code> clauses, the data for all columns + in the same row is available within that same data file. + </p> + + <p class="p"> + If an <code class="ph codeph">INSERT</code> statement brings in less than <span class="ph">one Parquet + block's worth</span> of data, the resulting data file is smaller than ideal. Thus, if you do split up an ETL + job to use multiple <code class="ph codeph">INSERT</code> statements, try to keep the volume of data for each + <code class="ph codeph">INSERT</code> statement to approximately <span class="ph">256 MB, or a multiple of + 256 MB</span>. + </p> + + </div> + + <article class="topic concept nested2" aria-labelledby="ariaid-title15" id="parquet_data_files__parquet_encoding"> + + <h3 class="title topictitle3" id="ariaid-title15">RLE and Dictionary Encoding for Parquet Data Files</h3> + + <div class="body conbody"> + + <p class="p"> + Parquet uses some automatic compression techniques, such as run-length encoding (RLE) and dictionary + encoding, based on analysis of the actual data values. Once the data values are encoded in a compact + form, the encoded data can optionally be further compressed using a compression algorithm. Parquet data + files created by Impala can use Snappy, GZip, or no compression; the Parquet spec also allows LZO + compression, but currently Impala does not support LZO-compressed Parquet files. + </p> + + <p class="p"> + RLE and dictionary encoding are compression techniques that Impala applies automatically to groups of + Parquet data values, in addition to any Snappy or GZip compression applied to the entire data files. + These automatic optimizations can save you time and planning that are normally needed for a traditional + data warehouse. For example, dictionary encoding reduces the need to create numeric IDs as abbreviations + for longer string values. + </p> + + <p class="p"> + Run-length encoding condenses sequences of repeated data values. For example, if many consecutive rows + all contain the same value for a country code, those repeating values can be represented by the value + followed by a count of how many times it appears consecutively. + </p> + + <p class="p"> + Dictionary encoding takes the different values present in a column, and represents each one in compact + 2-byte form rather than the original value, which could be several bytes. (Additional compression is + applied to the compacted values, for extra space savings.) This type of encoding applies when the number + of different values for a column is less than 2**16 (16,384). It does not apply to columns of data type + <code class="ph codeph">BOOLEAN</code>, which are already very short. <code class="ph codeph">TIMESTAMP</code> columns sometimes have + a unique value for each row, in which case they can quickly exceed the 2**16 limit on distinct values. + The 2**16 limit on different values within a column is reset for each data file, so if several different + data files each contained 10,000 different city names, the city name column in each data file could still + be condensed using dictionary encoding. + </p> + + </div> + + </article> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title16" id="parquet__parquet_compacting"> + + <h2 class="title topictitle2" id="ariaid-title16">Compacting Data Files for Parquet Tables</h2> + + <div class="body conbody"> + + <p class="p"> + If you reuse existing table structures or ETL processes for Parquet tables, you might encounter a <span class="q">"many + small files"</span> situation, which is suboptimal for query efficiency. For example, statements like these + might produce inefficiently organized data files: + </p> + +<pre class="pre codeblock"><code>-- In an N-node cluster, each node produces a data file +-- for the INSERT operation. If you have less than +-- N GB of data to copy, some files are likely to be +-- much smaller than the <span class="ph">default Parquet</span> block size. +insert into parquet_table select * from text_table; + +-- Even if this operation involves an overall large amount of data, +-- when split up by year/month/day, each partition might only +-- receive a small amount of data. Then the data files for +-- the partition might be divided between the N nodes in the cluster. +-- A multi-gigabyte copy operation might produce files of only +-- a few MB each. +insert into partitioned_parquet_table partition (year, month, day) + select year, month, day, url, referer, user_agent, http_code, response_time + from web_stats; +</code></pre> + + <p class="p"> + Here are techniques to help you produce large data files in Parquet <code class="ph codeph">INSERT</code> operations, and + to compact existing too-small data files: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + When inserting into a partitioned Parquet table, use statically partitioned <code class="ph codeph">INSERT</code> + statements where the partition key values are specified as constant values. Ideally, use a separate + <code class="ph codeph">INSERT</code> statement for each partition. + </p> + </li> + + <li class="li"> + <p class="p"> + You might set the <code class="ph codeph">NUM_NODES</code> option to 1 briefly, during <code class="ph codeph">INSERT</code> or + <code class="ph codeph">CREATE TABLE AS SELECT</code> statements. Normally, those statements produce one or more data + files per data node. If the write operation involves small amounts of data, a Parquet table, and/or a + partitioned table, the default behavior could produce many small files when intuitively you might expect + only a single output file. <code class="ph codeph">SET NUM_NODES=1</code> turns off the <span class="q">"distributed"</span> aspect of the + write operation, making it more likely to produce only one or a few data files. + </p> + </li> + + <li class="li"> + <p class="p"> + Be prepared to reduce the number of partition key columns from what you are used to with traditional + analytic database systems. + </p> + </li> + + <li class="li"> + <p class="p"> + Do not expect Impala-written Parquet files to fill up the entire Parquet block size. Impala estimates + on the conservative side when figuring out how much data to write to each Parquet file. Typically, the + of uncompressed data in memory is substantially reduced on disk by the compression and encoding + techniques in the Parquet file format. + + The final data file size varies depending on the compressibility of the data. Therefore, it is not an + indication of a problem if <span class="ph">256 MB</span> of text data is turned into 2 + Parquet data files, each less than <span class="ph">256 MB</span>. + </p> + </li> + + <li class="li"> + <p class="p"> + If you accidentally end up with a table with many small data files, consider using one or more of the + preceding techniques and copying all the data into a new Parquet table, either through <code class="ph codeph">CREATE + TABLE AS SELECT</code> or <code class="ph codeph">INSERT ... SELECT</code> statements. + </p> + + <p class="p"> + To avoid rewriting queries to change table names, you can adopt a convention of always running + important queries against a view. Changing the view definition immediately switches any subsequent + queries to use the new underlying tables: + </p> +<pre class="pre codeblock"><code>create view production_table as select * from table_with_many_small_files; +-- CTAS or INSERT...SELECT all the data into a more efficient layout... +alter view production_table as select * from table_with_few_big_files; +select * from production_table where c1 = 100 and c2 < 50 and ...; +</code></pre> + </li> + </ul> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title17" id="parquet__parquet_schema_evolution"> + + <h2 class="title topictitle2" id="ariaid-title17">Schema Evolution for Parquet Tables</h2> + + <div class="body conbody"> + + <p class="p"> + Schema evolution refers to using the statement <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to change + the names, data type, or number of columns in a table. You can perform schema evolution for Parquet tables + as follows: + </p> + + <ul class="ul"> + <li class="li"> + <p class="p"> + The Impala <code class="ph codeph">ALTER TABLE</code> statement never changes any data files in the tables. From the + Impala side, schema evolution involves interpreting the same data files in terms of a new table + definition. Some types of schema changes make sense and are represented correctly. Other types of + changes cannot be represented in a sensible way, and produce special result values or conversion errors + during queries. + </p> + </li> + + <li class="li"> + <p class="p"> + The <code class="ph codeph">INSERT</code> statement always creates data using the latest table definition. You might + end up with data files with different numbers of columns or internal data representations if you do a + sequence of <code class="ph codeph">INSERT</code> and <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> statements. + </p> + </li> + + <li class="li"> + <p class="p"> + If you use <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to define additional columns at the end, + when the original data files are used in a query, these final columns are considered to be all + <code class="ph codeph">NULL</code> values. + </p> + </li> + + <li class="li"> + <p class="p"> + If you use <code class="ph codeph">ALTER TABLE ... REPLACE COLUMNS</code> to define fewer columns than before, when + the original data files are used in a query, the unused columns still present in the data file are + ignored. + </p> + </li> + + <li class="li"> + <p class="p"> + Parquet represents the <code class="ph codeph">TINYINT</code>, <code class="ph codeph">SMALLINT</code>, and <code class="ph codeph">INT</code> + types the same internally, all stored in 32-bit integers. + </p> + <ul class="ul"> + <li class="li"> + That means it is easy to promote a <code class="ph codeph">TINYINT</code> column to <code class="ph codeph">SMALLINT</code> or + <code class="ph codeph">INT</code>, or a <code class="ph codeph">SMALLINT</code> column to <code class="ph codeph">INT</code>. The numbers are + represented exactly the same in the data file, and the columns being promoted would not contain any + out-of-range values. + </li> + + <li class="li"> + <p class="p"> + If you change any of these column types to a smaller type, any values that are out-of-range for the + new type are returned incorrectly, typically as negative numbers. + </p> + </li> + + <li class="li"> + <p class="p"> + You cannot change a <code class="ph codeph">TINYINT</code>, <code class="ph codeph">SMALLINT</code>, or <code class="ph codeph">INT</code> + column to <code class="ph codeph">BIGINT</code>, or the other way around. Although the <code class="ph codeph">ALTER + TABLE</code> succeeds, any attempt to query those columns results in conversion errors. + </p> + </li> + + <li class="li"> + <p class="p"> + Any other type conversion for columns produces a conversion error during queries. For example, + <code class="ph codeph">INT</code> to <code class="ph codeph">STRING</code>, <code class="ph codeph">FLOAT</code> to <code class="ph codeph">DOUBLE</code>, + <code class="ph codeph">TIMESTAMP</code> to <code class="ph codeph">STRING</code>, <code class="ph codeph">DECIMAL(9,0)</code> to + <code class="ph codeph">DECIMAL(5,2)</code>, and so on. + </p> + </li> + </ul> + </li> + </ul> + + <div class="p"> + You might find that you have Parquet files where the columns do not line up in the same + order as in your Impala table. For example, you might have a Parquet file that was part of + a table with columns <code class="ph codeph">C1,C2,C3,C4</code>, and now you want to reuse the same + Parquet file in a table with columns <code class="ph codeph">C4,C2</code>. By default, Impala expects the + columns in the data file to appear in the same order as the columns defined for the table, + making it impractical to do some kinds of file reuse or schema evolution. In <span class="keyword">Impala 2.6</span> + and higher, the query option <code class="ph codeph">PARQUET_FALLBACK_SCHEMA_RESOLUTION=name</code> lets Impala + resolve columns by name, and therefore handle out-of-order or extra columns in the data file. + For example: + +<pre class="pre codeblock"><code> +create database schema_evolution; +use schema_evolution; +create table t1 (c1 int, c2 boolean, c3 string, c4 timestamp) + stored as parquet; +insert into t1 values + (1, true, 'yes', now()), + (2, false, 'no', now() + interval 1 day); + +select * from t1; ++----+-------+-----+-------------------------------+ +| c1 | c2 | c3 | c4 | ++----+-------+-----+-------------------------------+ +| 1 | true | yes | 2016-06-28 14:53:26.554369000 | +| 2 | false | no | 2016-06-29 14:53:26.554369000 | ++----+-------+-----+-------------------------------+ + +desc formatted t1; +... +| Location: | /user/hive/warehouse/schema_evolution.db/t1 | +... + +-- Make T2 have the same data file as in T1, including 2 +-- unused columns and column order different than T2 expects. +load data inpath '/user/hive/warehouse/schema_evolution.db/t1' + into table t2; ++----------------------------------------------------------+ +| summary | ++----------------------------------------------------------+ +| Loaded 1 file(s). Total files in destination location: 1 | ++----------------------------------------------------------+ + +-- 'position' is the default setting. +-- Impala cannot read the Parquet file if the column order does not match. +set PARQUET_FALLBACK_SCHEMA_RESOLUTION=position; +PARQUET_FALLBACK_SCHEMA_RESOLUTION set to position + +select * from t2; +WARNINGS: +File 'schema_evolution.db/t2/45331705_data.0.parq' +has an incompatible Parquet schema for column 'schema_evolution.t2.c4'. +Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0] + +File 'schema_evolution.db/t2/45331705_data.0.parq' +has an incompatible Parquet schema for column 'schema_evolution.t2.c4'. +Column type: TIMESTAMP, Parquet schema: optional int32 c1 [i:0 d:1 r:0] + +-- With the 'name' setting, Impala can read the Parquet data files +-- despite mismatching column order. +set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name; +PARQUET_FALLBACK_SCHEMA_RESOLUTION set to name + +select * from t2; ++-------------------------------+-------+ +| c4 | c2 | ++-------------------------------+-------+ +| 2016-06-28 14:53:26.554369000 | true | +| 2016-06-29 14:53:26.554369000 | false | ++-------------------------------+-------+ + +</code></pre> + + See <a class="xref" href="impala_parquet_fallback_schema_resolution.html#parquet_fallback_schema_resolution">PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only)</a> + for more details. + </div> + + </div> + + </article> + + <article class="topic concept nested1" aria-labelledby="ariaid-title18" id="parquet__parquet_data_types"> + + <h2 class="title topictitle2" id="ariaid-title18">Data Type Considerations for Parquet Tables</h2> + + <div class="body conbody"> + + <p class="p"> + The Parquet format defines a set of data types whose names differ from the names of the corresponding + Impala data types. If you are preparing Parquet files using other Hadoop components such as Pig or + MapReduce, you might need to work with the type names defined by Parquet. The following figure lists the + Parquet-defined types and the equivalent types in Impala. + </p> + + <p class="p"> + <strong class="ph b">Primitive types:</strong> + </p> + +<pre class="pre codeblock"><code>BINARY -> STRING +BOOLEAN -> BOOLEAN +DOUBLE -> DOUBLE +FLOAT -> FLOAT +INT32 -> INT +INT64 -> BIGINT +INT96 -> TIMESTAMP +</code></pre> + + <p class="p"> + <strong class="ph b">Logical types:</strong> + </p> + +<pre class="pre codeblock"><code>BINARY + OriginalType UTF8 -> STRING +BINARY + OriginalType DECIMAL -> DECIMAL +</code></pre> + + <p class="p"> + <strong class="ph b">Complex types:</strong> + </p> + + <p class="p"> + For the complex types (<code class="ph codeph">ARRAY</code>, <code class="ph codeph">MAP</code>, and <code class="ph codeph">STRUCT</code>) + available in <span class="keyword">Impala 2.3</span> and higher, Impala only supports queries + against those types in Parquet tables. + </p> + + </div> + + </article> + +</article></main></body></html> \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html b/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html new file mode 100644 index 0000000..6f6ed71 --- /dev/null +++ b/docs/build/html/topics/impala_parquet_annotate_strings_utf8.html @@ -0,0 +1,54 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_annotate_strings_utf8"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (Impala 2.6 or higher only)</title></head><body id="parquet_annotate_strings_utf8"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">PARQUET_ANNOTATE_STRINGS_UTF8 Query Option (<span class="keyword">Impala 2.6</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Causes Impala <code class="ph codeph">INSERT</code> and <code class="ph codeph">CREATE TABLE AS SELECT</code> statements + to write Parquet files that use the UTF-8 annotation for <code class="ph codeph">STRING</code> columns. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + <p class="p"> + By default, Impala represents a <code class="ph codeph">STRING</code> column in Parquet as an unannotated binary field. + </p> + <p class="p"> + Impala always uses the UTF-8 annotation when writing <code class="ph codeph">CHAR</code> and <code class="ph codeph">VARCHAR</code> + columns to Parquet files. An alternative to using the query option is to cast <code class="ph codeph">STRING</code> + values to <code class="ph codeph">VARCHAR</code>. + </p> + <p class="p"> + This option is to help make Impala-written data more interoperable with other data processing engines. + Impala itself currently does not support all operations on UTF-8 data. + Although data processed by Impala is typically represented in ASCII, it is valid to designate the + data as UTF-8 when storing on disk, because ASCII is a subset of UTF-8. + </p> + <p class="p"> + <strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and <code class="ph codeph">false</code>; + any other value interpreted as <code class="ph codeph">false</code> + </p> + <p class="p"> + <strong class="ph b">Default:</strong> <code class="ph codeph">false</code> (shown as 0 in output of <code class="ph codeph">SET</code> statement) + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.6.0</span> + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_compression_codec.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_parquet_compression_codec.html b/docs/build/html/topics/impala_parquet_compression_codec.html new file mode 100644 index 0000000..34ae693 --- /dev/null +++ b/docs/build/html/topics/impala_parquet_compression_codec.html @@ -0,0 +1,17 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_compression_codec"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_COMPRESSION_CODEC Query Option</title></head><body id="parquet_compression_codec"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">PARQUET_COMPRESSION_CODEC Query Option</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Deprecated. Use <code class="ph codeph">COMPRESSION_CODEC</code> in Impala 2.0 and later. See + <a class="xref" href="impala_compression_codec.html#compression_codec">COMPRESSION_CODEC Query Option (Impala 2.0 or higher only)</a> for details. + </p> + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html b/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html new file mode 100644 index 0000000..91abf35 --- /dev/null +++ b/docs/build/html/topics/impala_parquet_fallback_schema_resolution.html @@ -0,0 +1,46 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_fallback_schema_resolution"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (Impala 2.6 or higher only)</title></head><body id="parquet_fallback_schema_resolution"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">PARQUET_FALLBACK_SCHEMA_RESOLUTION Query Option (<span class="keyword">Impala 2.6</span> or higher only)</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Allows Impala to look up columns within Parquet files by column name, rather than column order, + when necessary. + </p> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + <p class="p"> + By default, Impala looks up columns within a Parquet file based on + the order of columns in the table. + The <code class="ph codeph">name</code> setting for this option enables behavior for Impala queries + similar to the Hive setting <code class="ph codeph">parquet.column.index.access=false</code>. + It also allows Impala to query Parquet files created by Hive with the + <code class="ph codeph">parquet.column.index.access=false</code> setting in effect. + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> integer or string. + Allowed values are 0 or <code class="ph codeph">position</code> (default), 1 or <code class="ph codeph">name</code>. + </p> + + <p class="p"> + <strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.6.0</span> + </p> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + <p class="p"> + <a class="xref" href="impala_parquet.html#parquet_schema_evolution">Schema Evolution for Parquet Tables</a> + </p> + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/75c46918/docs/build/html/topics/impala_parquet_file_size.html ---------------------------------------------------------------------- diff --git a/docs/build/html/topics/impala_parquet_file_size.html b/docs/build/html/topics/impala_parquet_file_size.html new file mode 100644 index 0000000..695c557 --- /dev/null +++ b/docs/build/html/topics/impala_parquet_file_size.html @@ -0,0 +1,93 @@ +<!DOCTYPE html + SYSTEM "about:legacy-compat"> +<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta charset="UTF-8"><meta name="copyright" content="(C) Copyright 2017"><meta name="DC.rights.owner" content="(C) Copyright 2017"><meta name="DC.Type" content="concept"><meta name="DC.Relation" scheme="URI" content="../topics/impala_query_options.html"><meta name="prodname" content="Impala"><meta name="prodname" content="Impala"><meta name="version" content="Impala 2.8.x"><meta name="version" content="Impala 2.8.x"><meta name="DC.Format" content="XHTML"><meta name="DC.Identifier" content="parquet_file_size"><link rel="stylesheet" type="text/css" href="../commonltr.css"><title>PARQUET_FILE_SIZE Query Option</title></head><body id="parquet_file_size"><main role="main"><article role="article" aria-labelledby="ariaid-title1"> + + <h1 class="title topictitle1" id="ariaid-title1">PARQUET_FILE_SIZE Query Option</h1> + + + + <div class="body conbody"> + + <p class="p"> + + Specifies the maximum size of each Parquet data file produced by Impala <code class="ph codeph">INSERT</code> statements. + </p> + + <p class="p"> + <strong class="ph b">Syntax:</strong> + </p> + + <p class="p"> + Specify the size in bytes, or with a trailing <code class="ph codeph">m</code> or <code class="ph codeph">g</code> character to indicate + megabytes or gigabytes. For example: + </p> + +<pre class="pre codeblock"><code>-- 128 megabytes. +set PARQUET_FILE_SIZE=134217728 +INSERT OVERWRITE parquet_table SELECT * FROM text_table; + +-- 512 megabytes. +set PARQUET_FILE_SIZE=512m; +INSERT OVERWRITE parquet_table SELECT * FROM text_table; + +-- 1 gigabyte. +set PARQUET_FILE_SIZE=1g; +INSERT OVERWRITE parquet_table SELECT * FROM text_table; +</code></pre> + + <p class="p"> + <strong class="ph b">Usage notes:</strong> + </p> + + <p class="p"> + With tables that are small or finely partitioned, the default Parquet block size (formerly 1 GB, now 256 MB + in Impala 2.0 and later) could be much larger than needed for each data file. For <code class="ph codeph">INSERT</code> + operations into such tables, you can increase parallelism by specifying a smaller + <code class="ph codeph">PARQUET_FILE_SIZE</code> value, resulting in more HDFS blocks that can be processed by different + nodes. + + </p> + + <p class="p"> + <strong class="ph b">Type:</strong> numeric, with optional unit specifier + </p> + + <div class="note important note_important"><span class="note__title importanttitle">Important:</span> + <p class="p"> + Currently, the maximum value for this setting is 1 gigabyte (<code class="ph codeph">1g</code>). + Setting a value higher than 1 gigabyte could result in errors during + an <code class="ph codeph">INSERT</code> operation. + </p> + </div> + + <p class="p"> + <strong class="ph b">Default:</strong> 0 (produces files with a target size of 256 MB; files might be larger for very wide tables) + </p> + + <p class="p"> + <strong class="ph b">Isilon considerations:</strong> + </p> + <div class="p"> + Because the EMC Isilon storage devices use a global value for the block size + rather than a configurable value for each file, the <code class="ph codeph">PARQUET_FILE_SIZE</code> + query option has no effect when Impala inserts data into a table or partition + residing on Isilon storage. Use the <code class="ph codeph">isi</code> command to set the + default block size globally on the Isilon device. For example, to set the + Isilon default block size to 256 MB, the recommended size for Parquet + data files for Impala, issue the following command: +<pre class="pre codeblock"><code>isi hdfs settings modify --default-block-size=256MB</code></pre> + </div> + + <p class="p"> + <strong class="ph b">Related information:</strong> + </p> + + <p class="p"> + For information about the Parquet file format, and how the number and size of data files affects query + performance, see <a class="xref" href="impala_parquet.html#parquet">Using the Parquet File Format with Impala Tables</a>. + </p> + + + + </div> +<nav role="navigation" class="related-links"><div class="familylinks"><div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_query_options.html">Query Options for the SET Statement</a></div></div></nav></article></main></body></html> \ No newline at end of file
