Repository: impala Updated Branches: refs/heads/master 85f3bb017 -> d91bc4402
IMPALA-5937: [DOCS] PARQUET_READ_STATISTICS and PARQUET_DICTIONARY_FILTERING Change-Id: I88fa8c4a64560711251076c50e1695f7f032f9c0 Reviewed-on: http://gerrit.cloudera.org:8080/11355 Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Reviewed-by: Lars Volker <l...@cloudera.com> Project: http://git-wip-us.apache.org/repos/asf/impala/repo Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/c28dd512 Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/c28dd512 Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/c28dd512 Branch: refs/heads/master Commit: c28dd512d35d3d3c4e94fb9769bfbe71cc2face7 Parents: 85f3bb0 Author: Alex Rodoni <arod...@cloudera.com> Authored: Wed Aug 29 16:59:12 2018 -0700 Committer: Alex Rodoni <arod...@cloudera.com> Committed: Tue Sep 11 21:07:32 2018 +0000 ---------------------------------------------------------------------- docs/impala.ditamap | 2 + docs/topics/impala_parquet.xml | 61 ++------- .../impala_parquet_dictionary_filtering.xml | 128 +++++++++++++++++++ docs/topics/impala_parquet_read_statistics.xml | 117 +++++++++++++++++ 4 files changed, 261 insertions(+), 47 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/impala.ditamap ---------------------------------------------------------------------- diff --git a/docs/impala.ditamap b/docs/impala.ditamap index 1c7a69c..ce88c01 100644 --- a/docs/impala.ditamap +++ b/docs/impala.ditamap @@ -211,8 +211,10 @@ under the License. <topicref href="topics/impala_parquet_compression_codec.xml"/> <topicref rev="2.6.0 IMPALA-2069" href="topics/impala_parquet_annotate_strings_utf8.xml"/> <topicref rev="2.9.0 IMPALA-4725" href="topics/impala_parquet_array_resolution.xml"/> + <topicref href="topics/impala_parquet_dictionary_filtering.xml"/> <topicref rev="2.6.0 IMPALA-2835" href="topics/impala_parquet_fallback_schema_resolution.xml"/> <topicref href="topics/impala_parquet_file_size.xml"/> + <topicref href="topics/impala_parquet_read_statistics.xml"/> <topicref rev="2.6.0 IMPALA-3286" href="topics/impala_prefetch_mode.xml"/> <topicref href="topics/impala_query_timeout_s.xml"/> <topicref href="topics/impala_request_pool.xml"/> http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/topics/impala_parquet.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet.xml b/docs/topics/impala_parquet.xml index 3667560..b5fda78 100644 --- a/docs/topics/impala_parquet.xml +++ b/docs/topics/impala_parquet.xml @@ -35,56 +35,23 @@ under the License. </prolog> <conbody> - <p> - <indexterm audience="hidden">Parquet support in Impala</indexterm> - Impala helps you to create, manage, and query Parquet tables. Parquet is a column-oriented binary file format - intended to be highly efficient for the types of large-scale queries that Impala is best at. Parquet is - especially good for queries scanning particular columns within a table, for example to query <q>wide</q> - tables with many columns, or to perform aggregation operations such as <codeph>SUM()</codeph> and - <codeph>AVG()</codeph> that need to process most or all of the values from a column. Each data file contains - the values for a set of rows (the <q>row group</q>). Within a data file, the values from each column are - organized so that they are all adjacent, enabling good compression for the values from that column. Queries - against a Parquet table can retrieve and analyze these values from any column quickly and with minimal I/O. + Impala allows you to create, manage, and query Parquet tables. Parquet + is a column-oriented binary file format intended to be highly efficient + for the types of large-scale queries that Impala is best at. Parquet is + especially good for queries scanning particular columns within a table, + for example to query <q>wide</q> tables with many columns, or to perform + aggregation operations such as <codeph>SUM()</codeph> and + <codeph>AVG()</codeph> that need to process most or all of the values + from a column. Each data file contains the values for a set of rows (the + <q>row group</q>). Within a data file, the values from each column are + organized so that they are all adjacent, enabling good compression for the + values from that column. Queries against a Parquet table can retrieve and + analyze these values from any column quickly and with minimal I/O. </p> - - <table> - <title>Parquet Format Support in Impala</title> - <tgroup cols="5"> - <colspec colname="1" colwidth="10*"/> - <colspec colname="2" colwidth="10*"/> - <colspec colname="3" colwidth="20*"/> - <colspec colname="4" colwidth="30*"/> - <colspec colname="5" colwidth="30*"/> - <thead> - <row> - <entry> - File Type - </entry> - <entry> - Format - </entry> - <entry> - Compression Codecs - </entry> - <entry> - Impala Can CREATE? - </entry> - <entry> - Impala Can INSERT? - </entry> - </row> - </thead> - <tbody> - <row conref="impala_file_formats.xml#file_formats/parquet_support"> - <entry/> - </row> - </tbody> - </tgroup> - </table> - + <p>See <xref href="impala_file_formats.xml#file_formats"/> for the summary + of Parquet format support.</p> <p outputclass="toc inpage"/> - </conbody> http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/topics/impala_parquet_dictionary_filtering.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet_dictionary_filtering.xml b/docs/topics/impala_parquet_dictionary_filtering.xml new file mode 100644 index 0000000..26a3e8e --- /dev/null +++ b/docs/topics/impala_parquet_dictionary_filtering.xml @@ -0,0 +1,128 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="parquet_dictionary_filtering" rev="2.9.0 IMPALA-4725"> + + <title>PARQUET_DICTIONARY_FILTERING Query Option (<keyword keyref="impala29"/> or higher only)</title> + + <titlealts audience="PDF"> + + <navtitle>PARQUET_DICTIONARY_FILTERING</navtitle> + + </titlealts> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Parquet"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p rev="parquet_dictionary_filtering"> + The <codeph>PARQUET_DICTIONARY_FILTERING</codeph> query option controls whether Impala + uses dictionary filtering for Parquet filers. + </p> + + <p> + To efficiently process a highly selective scan query, when this option is enabled, Impala + checks the values in the Parquet dictionary page and determine if the whole row group can + be thrown out. + </p> + + <p> + A column chunk is purely dictionary encoded and can be used by dictionary filtering if any + of the following conditions meets: + <ol> + <li> + If the <codeph>encoding_stats</codeph> is in the Parquet file, dictionary filtering + uses it to determine if there are only dictionary encoded pages (i.e. there are no + data pages with an encoding other than PLAIN_DICTIONARY). + </li> + + <li> + If the encoding stats are not present, dictionary filtering looks at the encodings. + The column is purely dictionary encoded if both of the conditions satisfy: + <ul> + <li> + PLAIN_DICTIONARY is present. + </li> + + <li> + Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed. + </li> + </ul> + </li> + + <li> + Dictionary filtering works for the Parquet dictionaries with less than 40000 values if + the file was written by <keyword + keyref="impala29"> or lower</keyword>. + </li> + </ol> + </p> + + <p> + In the query runtime profile output for each Impalad instance, the + <codeph>NumDictFilteredRowGroups</codeph> field in the SCAN node section shows the number + of row groups that were skipped based on dictionary filtering. + </p> + + <p> + Note that row groups can be filtered out by Parquet statistics, and in such cases, + dictionary filtering will not be considered. + </p> + + <p> + The supported values for the query option are: + <ul> + <li> + <codeph>true</codeph> (<codeph>1</codeph>): Use dictionary filtering. + </li> + + <li> + <codeph>false</codeph> (<codeph>0</codeph>): Do not use dictionary filtering + </li> + + <li> + Any other values are treated as <codeph>false</codeph>. + </li> + </ul> + </p> + + <p> + <b>Type:</b> Boolean + </p> + + <p> + <b>Default:</b> <codeph>true</codeph> (<codeph>1</codeph>) + </p> + + <p conref="../shared/impala_common.xml#common/added_in_290"/> + + <p conref="../shared/impala_common.xml#common/example_blurb"/> + + </conbody> + +</concept> http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/topics/impala_parquet_read_statistics.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet_read_statistics.xml b/docs/topics/impala_parquet_read_statistics.xml new file mode 100644 index 0000000..632d73a --- /dev/null +++ b/docs/topics/impala_parquet_read_statistics.xml @@ -0,0 +1,117 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="parquet_read_statistics"> + + <title>PARQUET_READ_STATISTICS Query Option (<keyword keyref="impala29"/> or higher only)</title> + + <titlealts audience="PDF"> + + <navtitle>PARQUET_READ_STATISTICS</navtitle> + + </titlealts> + + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Query Options"/> + <data name="Category" value="Parquet"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Data Analysts"/> + </metadata> + </prolog> + + <conbody> + + <p> + The <codeph>PARQUET_READ_STATISTICS</codeph> query option controls whether to read + statistics from Parquet files and use them during query processing. + </p> + + <p> + Parquet stores min/max stats which can be used to skip reading row groups if they don't + qualify a certain predicate. When this query option is set to <codeph>true</codeph>, + Impala reads the Parquet statistics and skips reading row groups that do not match the + conditions in the <codeph>WHERE</codeph> clause. + </p> + + <p> + Impala supports filtering based on Parquet statistics: + </p> + + <ul> + <li> + Of the numerical types for the old version of the statistics: Boolean, Integer, Float + </li> + + <li> + Of the types for the new version of the statistics (starting in IMPALA 2.8): Boolean, + Integer, Float, Decimal, String, Timestamp + </li> + + <li> + For simple predicates of the forms: <codeph><slot> <op> <constant></codeph> or + <codeph><constant> <op> <slot></codeph>, where <codeph><op></codeph> is LT, + LE, GE, GT, and EQ + </li> + </ul> + + <p> + The <codeph>PARQUET_READ_STATISTICS</codeph> option provides a workaround when dealing + with files that have corrupt Parquet statistics and unknown errors. + </p> + + <p> + In the query runtime profile output for each Impalad instance, the + <codeph>NumStatsFilteredRowGroups</codeph> field in the SCAN node section shows the number + of row groups that were skipped based on Parquet statistics. + </p> + + <p> + The supported values for the query option are: + <ul> + <li> + <codeph>true</codeph> (<codeph>1</codeph>): Read statistics from Parquet files and use + them in query processing. + </li> + + <li> + <codeph>false</codeph> (<codeph>0</codeph>): Do not use Parquet read statistics. + </li> + + <li> + Any other values are treated as <codeph>false</codeph>. + </li> + </ul> + </p> + + <p> + <b>Type:</b> Boolean + </p> + + <p> + <b>Default:</b> <codeph>true</codeph> + </p> + + <p conref="../shared/impala_common.xml#common/added_in_290"/> + + </conbody> + +</concept>