[1/2] impala git commit: IMPALA-5937: [DOCS] PARQUET_READ_STATISTICS and PARQUET_DICTIONARY_FILTERING

arodoni Tue, 11 Sep 2018 15:13:44 -0700

Repository: impala
Updated Branches:
  refs/heads/master 85f3bb017 -> d91bc4402



IMPALA-5937: [DOCS] PARQUET_READ_STATISTICS and PARQUET_DICTIONARY_FILTERING

Change-Id: I88fa8c4a64560711251076c50e1695f7f032f9c0
Reviewed-on: http://gerrit.cloudera.org:8080/11355
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Reviewed-by: Lars Volker <l...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/impala/commit/c28dd512
Tree: http://git-wip-us.apache.org/repos/asf/impala/tree/c28dd512
Diff: http://git-wip-us.apache.org/repos/asf/impala/diff/c28dd512

Branch: refs/heads/master
Commit: c28dd512d35d3d3c4e94fb9769bfbe71cc2face7
Parents: 85f3bb0
Author: Alex Rodoni <arod...@cloudera.com>
Authored: Wed Aug 29 16:59:12 2018 -0700
Committer: Alex Rodoni <arod...@cloudera.com>
Committed: Tue Sep 11 21:07:32 2018 +0000

----------------------------------------------------------------------
 docs/impala.ditamap                             |   2 +
 docs/topics/impala_parquet.xml                  |  61 ++-------
 .../impala_parquet_dictionary_filtering.xml     | 128 +++++++++++++++++++
 docs/topics/impala_parquet_read_statistics.xml  | 117 +++++++++++++++++
 4 files changed, 261 insertions(+), 47 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/impala.ditamap
----------------------------------------------------------------------
diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 1c7a69c..ce88c01 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -211,8 +211,10 @@ under the License.
           <topicref href="topics/impala_parquet_compression_codec.xml"/>
           <topicref rev="2.6.0 IMPALA-2069" 
href="topics/impala_parquet_annotate_strings_utf8.xml"/>
           <topicref rev="2.9.0 IMPALA-4725" 
href="topics/impala_parquet_array_resolution.xml"/>
+          <topicref href="topics/impala_parquet_dictionary_filtering.xml"/>
           <topicref rev="2.6.0 IMPALA-2835" 
href="topics/impala_parquet_fallback_schema_resolution.xml"/>
           <topicref href="topics/impala_parquet_file_size.xml"/>
+          <topicref href="topics/impala_parquet_read_statistics.xml"/>
           <topicref rev="2.6.0 IMPALA-3286" 
href="topics/impala_prefetch_mode.xml"/>
           <topicref href="topics/impala_query_timeout_s.xml"/>
           <topicref href="topics/impala_request_pool.xml"/>

http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/topics/impala_parquet.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet.xml b/docs/topics/impala_parquet.xml
index 3667560..b5fda78 100644
--- a/docs/topics/impala_parquet.xml
+++ b/docs/topics/impala_parquet.xml
@@ -35,56 +35,23 @@ under the License.
   </prolog>
 
   <conbody>
-
     <p>
-      <indexterm audience="hidden">Parquet support in Impala</indexterm>
-      Impala helps you to create, manage, and query Parquet tables. Parquet is 
a column-oriented binary file format
-      intended to be highly efficient for the types of large-scale queries 
that Impala is best at. Parquet is
-      especially good for queries scanning particular columns within a table, 
for example to query <q>wide</q>
-      tables with many columns, or to perform aggregation operations such as 
<codeph>SUM()</codeph> and
-      <codeph>AVG()</codeph> that need to process most or all of the values 
from a column. Each data file contains
-      the values for a set of rows (the <q>row group</q>). Within a data file, 
the values from each column are
-      organized so that they are all adjacent, enabling good compression for 
the values from that column. Queries
-      against a Parquet table can retrieve and analyze these values from any 
column quickly and with minimal I/O.
+      Impala allows you to create, manage, and query Parquet tables. Parquet
+      is a column-oriented binary file format intended to be highly efficient
+      for the types of large-scale queries that Impala is best at. Parquet is
+      especially good for queries scanning particular columns within a table,
+      for example to query <q>wide</q> tables with many columns, or to perform
+      aggregation operations such as <codeph>SUM()</codeph> and
+        <codeph>AVG()</codeph> that need to process most or all of the values
+      from a column. Each data file contains the values for a set of rows (the
+        <q>row group</q>). Within a data file, the values from each column are
+      organized so that they are all adjacent, enabling good compression for 
the
+      values from that column. Queries against a Parquet table can retrieve and
+      analyze these values from any column quickly and with minimal I/O.
     </p>
-
-    <table>
-      <title>Parquet Format Support in Impala</title>
-      <tgroup cols="5">
-        <colspec colname="1" colwidth="10*"/>
-        <colspec colname="2" colwidth="10*"/>
-        <colspec colname="3" colwidth="20*"/>
-        <colspec colname="4" colwidth="30*"/>
-        <colspec colname="5" colwidth="30*"/>
-        <thead>
-          <row>
-            <entry>
-              File Type
-            </entry>
-            <entry>
-              Format
-            </entry>
-            <entry>
-              Compression Codecs
-            </entry>
-            <entry>
-              Impala Can CREATE?
-            </entry>
-            <entry>
-              Impala Can INSERT?
-            </entry>
-          </row>
-        </thead>
-        <tbody>
-          <row conref="impala_file_formats.xml#file_formats/parquet_support">
-            <entry/>
-          </row>
-        </tbody>
-      </tgroup>
-    </table>
-
+    <p>See <xref href="impala_file_formats.xml#file_formats"/> for the summary
+      of Parquet format support.</p>
     <p outputclass="toc inpage"/>
-
   </conbody>
 
 

http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/topics/impala_parquet_dictionary_filtering.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet_dictionary_filtering.xml 
b/docs/topics/impala_parquet_dictionary_filtering.xml
new file mode 100644
index 0000000..26a3e8e
--- /dev/null
+++ b/docs/topics/impala_parquet_dictionary_filtering.xml
@@ -0,0 +1,128 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="parquet_dictionary_filtering" rev="2.9.0 IMPALA-4725">
+
+  <title>PARQUET_DICTIONARY_FILTERING Query Option (<keyword 
keyref="impala29"/> or higher only)</title>
+
+  <titlealts audience="PDF">
+
+    <navtitle>PARQUET_DICTIONARY_FILTERING</navtitle>
+
+  </titlealts>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Parquet"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p rev="parquet_dictionary_filtering">
+      The <codeph>PARQUET_DICTIONARY_FILTERING</codeph> query option controls 
whether Impala
+      uses dictionary filtering for Parquet filers.
+    </p>
+
+    <p>
+      To efficiently process a highly selective scan query, when this option 
is enabled, Impala
+      checks the values in the Parquet dictionary page and determine if the 
whole row group can
+      be thrown out.
+    </p>
+
+    <p>
+      A column chunk is purely dictionary encoded and can be used by 
dictionary filtering if any
+      of the following conditions meets:
+      <ol>
+        <li>
+          If the <codeph>encoding_stats</codeph> is in the Parquet file, 
dictionary filtering
+          uses it to determine if there are only dictionary encoded pages 
(i.e. there are no
+          data pages with an encoding other than PLAIN_DICTIONARY).
+        </li>
+
+        <li>
+          If the encoding stats are not present, dictionary filtering looks at 
the encodings.
+          The column is purely dictionary encoded if both of the conditions 
satisfy:
+          <ul>
+            <li>
+              PLAIN_DICTIONARY is present.
+            </li>
+
+            <li>
+              Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed.
+            </li>
+          </ul>
+        </li>
+
+        <li>
+          Dictionary filtering works for the Parquet dictionaries with less 
than 40000 values if
+          the file was written by <keyword
+            keyref="impala29"> or lower</keyword>.
+        </li>
+      </ol>
+    </p>
+
+    <p>
+      In the query runtime profile output for each Impalad instance, the
+      <codeph>NumDictFilteredRowGroups</codeph> field in the SCAN node section 
shows the number
+      of row groups that were skipped based on dictionary filtering.
+    </p>
+
+    <p>
+      Note that row groups can be filtered out by Parquet statistics, and in 
such cases,
+      dictionary filtering will not be considered.
+    </p>
+
+    <p>
+      The supported values for the query option are:
+      <ul>
+        <li>
+          <codeph>true</codeph> (<codeph>1</codeph>): Use dictionary filtering.
+        </li>
+
+        <li>
+          <codeph>false</codeph> (<codeph>0</codeph>): Do not use dictionary 
filtering
+        </li>
+
+        <li>
+          Any other values are treated as <codeph>false</codeph>.
+        </li>
+      </ul>
+    </p>
+
+    <p>
+      <b>Type:</b> Boolean
+    </p>
+
+    <p>
+      <b>Default:</b> <codeph>true</codeph> (<codeph>1</codeph>)
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/added_in_290"/>
+
+    <p conref="../shared/impala_common.xml#common/example_blurb"/>
+
+  </conbody>
+
+</concept>

http://git-wip-us.apache.org/repos/asf/impala/blob/c28dd512/docs/topics/impala_parquet_read_statistics.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet_read_statistics.xml 
b/docs/topics/impala_parquet_read_statistics.xml
new file mode 100644
index 0000000..632d73a
--- /dev/null
+++ b/docs/topics/impala_parquet_read_statistics.xml
@@ -0,0 +1,117 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
+<concept id="parquet_read_statistics">
+
+  <title>PARQUET_READ_STATISTICS Query Option (<keyword keyref="impala29"/> or 
higher only)</title>
+
+  <titlealts audience="PDF">
+
+    <navtitle>PARQUET_READ_STATISTICS</navtitle>
+
+  </titlealts>
+
+  <prolog>
+    <metadata>
+      <data name="Category" value="Impala"/>
+      <data name="Category" value="Impala Query Options"/>
+      <data name="Category" value="Parquet"/>
+      <data name="Category" value="Developers"/>
+      <data name="Category" value="Data Analysts"/>
+    </metadata>
+  </prolog>
+
+  <conbody>
+
+    <p>
+      The <codeph>PARQUET_READ_STATISTICS</codeph> query option controls 
whether to read
+      statistics from Parquet files and use them during query processing.
+    </p>
+
+    <p>
+      Parquet stores min/max stats which can be used to skip reading row 
groups if they don't
+      qualify a certain predicate. When this query option is set to 
<codeph>true</codeph>,
+      Impala reads the Parquet statistics and skips reading row groups that do 
not match the
+      conditions in the <codeph>WHERE</codeph> clause.
+    </p>
+
+    <p>
+      Impala supports filtering based on Parquet statistics:
+    </p>
+
+    <ul>
+      <li>
+        Of the numerical types for the old version of the statistics: Boolean, 
Integer, Float
+      </li>
+
+      <li>
+        Of the types for the new version of the statistics (starting in IMPALA 
2.8): Boolean,
+        Integer, Float, Decimal, String, Timestamp
+      </li>
+
+      <li>
+        For simple predicates of the forms: <codeph>&lt;slot> &lt;op> 
&lt;constant></codeph> or
+        <codeph>&lt;constant> &lt;op> &lt;slot></codeph>, where 
<codeph>&lt;op></codeph> is LT,
+        LE, GE, GT, and EQ
+      </li>
+    </ul>
+
+    <p>
+      The <codeph>PARQUET_READ_STATISTICS</codeph> option provides a 
workaround when dealing
+      with files that have corrupt Parquet statistics and unknown errors.
+    </p>
+
+    <p>
+      In the query runtime profile output for each Impalad instance, the
+      <codeph>NumStatsFilteredRowGroups</codeph> field in the SCAN node 
section shows the number
+      of row groups that were skipped based on Parquet statistics.
+    </p>
+
+    <p>
+      The supported values for the query option are:
+      <ul>
+        <li>
+          <codeph>true</codeph> (<codeph>1</codeph>): Read statistics from 
Parquet files and use
+          them in query processing.
+        </li>
+
+        <li>
+          <codeph>false</codeph> (<codeph>0</codeph>): Do not use Parquet read 
statistics.
+        </li>
+
+        <li>
+          Any other values are treated as <codeph>false</codeph>.
+        </li>
+      </ul>
+    </p>
+
+    <p>
+      <b>Type:</b> Boolean
+    </p>
+
+    <p>
+      <b>Default:</b> <codeph>true</codeph>
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/added_in_290"/>
+
+  </conbody>
+
+</concept>

[1/2] impala git commit: IMPALA-5937: [DOCS] PARQUET_READ_STATISTICS and PARQUET_DICTIONARY_FILTERING

Reply via email to