IMPALA-3909: [DOCS] Add general info about Parquet min/max optimization

Just putting an initial stake in the ground. If examples,
details of Hive interoperability, or type-by-type details
are needed, I prefer to handle those in followup gerrits.

Change-Id: I5fd5f7b157024f6089af7feffcb538c160bb130d
Reviewed-on: http://gerrit.cloudera.org:8080/7068
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins


Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/67b63f37
Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/67b63f37
Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/67b63f37

Branch: refs/heads/master
Commit: 67b63f37e6f99086b64415bc9b0b99bdbbae92d5
Parents: 27b0a5e
Author: John Russell <[email protected]>
Authored: Fri Jun 2 15:59:47 2017 -0700
Committer: Impala Public Jenkins <[email protected]>
Committed: Mon Jun 26 18:05:05 2017 +0000

----------------------------------------------------------------------
 docs/topics/impala_parquet.xml | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/67b63f37/docs/topics/impala_parquet.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_parquet.xml b/docs/topics/impala_parquet.xml
index fee3180..af6ab73 100644
--- a/docs/topics/impala_parquet.xml
+++ b/docs/topics/impala_parquet.xml
@@ -357,6 +357,25 @@ Add an example here.
 
       <p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
 
+      <p rev="IMPALA-3909">
+        In <keyword keyref="impala29"/> and higher, Parquet files written by 
Impala include
+        embedded metadata specifying the minimum and maximum values for each 
column, within
+        each row group and each data page within the row group. Impala-written 
Parquet files
+        typically contain a single row group; a row group can contain many 
data pages.
+        Impala uses this information (currently, only the metadata for each 
row group)
+        when reading each Parquet data file during a query, to quickly 
determine whether each
+        row group within the file potentially includes any rows that match the 
conditions in the
+        <codeph>WHERE</codeph> clause. For example, if the column 
<codeph>X</codeph> within
+        a particular Parquet file has a minimum value of 1 and a maximum value 
of 100, then
+        a query including the clause <codeph>WHERE x &gt; 200</codeph> can 
quickly determine
+        that it is safe to skip that particular file, instead of scanning all 
the associated
+        column values. This optimization technique is especially effective for 
tables that
+        use the <codeph>SORT BY</codeph> clause for the columns most 
frequently checked in
+        <codeph>WHERE</codeph> clauses, because any <codeph>INSERT</codeph> 
operation on
+        such tables produces Parquet data files with relatively narrow ranges 
of column values
+        within each file.
+      </p>
+
     </conbody>
 
     <concept id="parquet_partitioning">

Reply via email to