Repository: incubator-impala Updated Branches: refs/heads/doc_prototype 0124ae32f -> e9e4f18cd
Upgrade impala_file_formats.xml to latest version - it contains some new IDs used as conref targets. Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/e9e4f18c Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/e9e4f18c Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/e9e4f18c Branch: refs/heads/doc_prototype Commit: e9e4f18cd4d16a53b7c034a6d003292c50ce1bf4 Parents: 0124ae3 Author: John Russell <[email protected]> Authored: Mon Oct 31 12:27:08 2016 -0700 Committer: John Russell <[email protected]> Committed: Mon Oct 31 12:27:08 2016 -0700 ---------------------------------------------------------------------- docs/topics/impala_file_formats.xml | 236 ++++++++++++++++++++++++++++++- 1 file changed, 235 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/e9e4f18c/docs/topics/impala_file_formats.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_file_formats.xml b/docs/topics/impala_file_formats.xml index 64bf8a5..48b9e7c 100644 --- a/docs/topics/impala_file_formats.xml +++ b/docs/topics/impala_file_formats.xml @@ -29,8 +29,242 @@ considerations for using each file format with Impala. </p> - + <p> + The file format used for an Impala table has significant performance consequences. Some file formats include + compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU + resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting + factor in query performance since querying often begins with moving and decompressing data. To reduce the + potential impact of this part of the process, data is often compressed. By compressing data, a smaller total + number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the + data, but a tradeoff occurs when the CPU decompresses the content. + </p> + + <p> + Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop. + Impala can create and insert data into tables that use some file formats but not others; for file formats + that Impala cannot write to, create the table in Hive, issue the <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph> + statement in <codeph>impala-shell</codeph>, and query the table through Impala. File formats can be + structured, in which case they may include metadata and built-in compression. Supported formats include: + </p> + + <table> + <title>File Format Support in Impala</title> + <tgroup cols="5"> + <colspec colname="1" colwidth="10*"/> + <colspec colname="2" colwidth="10*"/> + <colspec colname="3" colwidth="20*"/> + <colspec colname="4" colwidth="30*"/> + <colspec colname="5" colwidth="30*"/> + <thead> + <row> + <entry> + File Type + </entry> + <entry> + Format + </entry> + <entry> + Compression Codecs + </entry> + <entry> + Impala Can CREATE? + </entry> + <entry> + Impala Can INSERT? + </entry> + </row> + </thead> + <tbody> + <row id="parquet_support"> + <entry> + <xref href="impala_parquet.xml#parquet">Parquet</xref> + </entry> + <entry> + Structured + </entry> + <entry> + Snappy, gzip; currently Snappy by default + </entry> + <entry> + Yes. + </entry> + <entry> + Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query. + </entry> + </row> + <row id="txtfile_support"> + <entry> + <xref href="impala_txtfile.xml#txtfile">Text</xref> + </entry> + <entry> + Unstructured + </entry> + <entry rev="2.0.0"> + LZO, gzip, bzip2, Snappy + </entry> + <entry> + Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause, the default file + format is uncompressed text, with values separated by ASCII <codeph>0x01</codeph> characters + (typically represented as Ctrl-A). + </entry> + <entry> + Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query. + If LZO compression is used, you must create the table and load data in Hive. If other kinds of + compression are used, you must load data through <codeph>LOAD DATA</codeph>, Hive, or manually in + HDFS. + +<!-- <ph rev="2.0.0">Impala 2.0 and higher can write LZO-compressed text data; for earlier Impala releases, you must create the table and load data in Hive.</ph> --> + </entry> + </row> + <row id="avro_support"> + <entry> + <xref href="impala_avro.xml#avro">Avro</xref> + </entry> + <entry> + Structured + </entry> + <entry> + Snappy, gzip, deflate, bzip2 + </entry> + <entry rev="1.4.0"> + Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive. + </entry> + <entry> + No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use + <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala. + </entry> +<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> --> + </row> + <row id="rcfile_support"> + <entry> + <xref href="impala_rcfile.xml#rcfile">RCFile</xref> + </entry> + <entry> + Structured + </entry> + <entry> + Snappy, gzip, deflate, bzip2 + </entry> + <entry> + Yes. + </entry> + <entry> + No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use + <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala. + </entry> +<!-- + <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> + --> + </row> + <row id="sequencefile_support"> + <entry> + <xref href="impala_seqfile.xml#seqfile">SequenceFile</xref> + </entry> + <entry> + Structured + </entry> + <entry> + Snappy, gzip, deflate, bzip2 + </entry> + <entry>Yes.</entry> + <entry> + No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use + <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala. + </entry> +<!-- + <entry rev="2.0.0"> + Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD + DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive. + </entry> +--> + </row> + </tbody> + </tgroup> + </table> + + <p rev="DOCS-1370"> + Impala can only query the file formats listed in the preceding table. + In particular, Impala does not support the ORC file format. + </p> + + <p> + Impala supports the following compression codecs: + </p> + + <ul> + <li rev="2.0.0"> + Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy + compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0 + and higher. +<!-- Not supported for text files. --> + </li> + + <li rev="2.0.0"> + Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space + savings) is desired. Supported for text files in Impala 2.0 and higher. + </li> + + <li> + Deflate. Not supported for text files. + </li> + + <li rev="2.0.0"> + Bzip2. Supported for text files in Impala 2.0 and higher. +<!-- Not supported for text files. --> + </li> + + <li> + <p rev="2.0.0"> LZO, for text files only. Impala can query + LZO-compressed text tables, but currently cannot create them or insert + data into them; perform these operations in Hive. </p> + </li> + </ul> </conbody> + <concept id="file_format_choosing"> + + <title>Choosing the File Format for a Table</title> + <prolog> + <metadata> + <data name="Category" value="Planning"/> + </metadata> + </prolog> + + <conbody> + + <p> + Different file formats and compression codecs work better for different data sets. While Impala typically + provides performance gains regardless of file format, choosing the proper format for your data can yield + further performance improvements. Use the following considerations to decide which combination of file + format and compression to use for a particular table: + </p> + + <ul> + <li> + If you are working with existing files that are already in a supported file format, use the same format + for the Impala table where practical. If the original format does not yield acceptable query performance + or resource usage, consider creating a new Impala table with different file format or compression + characteristics, and doing a one-time conversion by copying the data to the new table using the + <codeph>INSERT</codeph> statement. Depending on the file format, you might run the + <codeph>INSERT</codeph> statement in <codeph>impala-shell</codeph> or in Hive. + </li> + + <li> + Text files are convenient to produce through many different tools, and are human-readable for ease of + verification and debugging. Those characteristics are why text is the default format for an Impala + <codeph>CREATE TABLE</codeph> statement. When performance and resource usage are the primary + considerations, use one of the other file formats and consider using compression. A typical workflow + might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data + directory, and then using the <codeph>INSERT ... SELECT</codeph> syntax to copy the data into a table + using a different, more compact file format. + </li> + <li> + If your architecture involves storing data to be queried in memory, do not compress the data. There is no + I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the + data. + </li> + </ul> + </conbody> + </concept> </concept>
