IMPALA-3909: [DOCS] Add general info about Parquet min/max optimization Just putting an initial stake in the ground. If examples, details of Hive interoperability, or type-by-type details are needed, I prefer to handle those in followup gerrits.
Change-Id: I5fd5f7b157024f6089af7feffcb538c160bb130d Reviewed-on: http://gerrit.cloudera.org:8080/7068 Reviewed-by: Tim Armstrong <[email protected]> Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/67b63f37 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/67b63f37 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/67b63f37 Branch: refs/heads/master Commit: 67b63f37e6f99086b64415bc9b0b99bdbbae92d5 Parents: 27b0a5e Author: John Russell <[email protected]> Authored: Fri Jun 2 15:59:47 2017 -0700 Committer: Impala Public Jenkins <[email protected]> Committed: Mon Jun 26 18:05:05 2017 +0000 ---------------------------------------------------------------------- docs/topics/impala_parquet.xml | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/67b63f37/docs/topics/impala_parquet.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_parquet.xml b/docs/topics/impala_parquet.xml index fee3180..af6ab73 100644 --- a/docs/topics/impala_parquet.xml +++ b/docs/topics/impala_parquet.xml @@ -357,6 +357,25 @@ Add an example here. <p conref="../shared/impala_common.xml#common/s3_block_splitting"/> + <p rev="IMPALA-3909"> + In <keyword keyref="impala29"/> and higher, Parquet files written by Impala include + embedded metadata specifying the minimum and maximum values for each column, within + each row group and each data page within the row group. Impala-written Parquet files + typically contain a single row group; a row group can contain many data pages. + Impala uses this information (currently, only the metadata for each row group) + when reading each Parquet data file during a query, to quickly determine whether each + row group within the file potentially includes any rows that match the conditions in the + <codeph>WHERE</codeph> clause. For example, if the column <codeph>X</codeph> within + a particular Parquet file has a minimum value of 1 and a maximum value of 100, then + a query including the clause <codeph>WHERE x > 200</codeph> can quickly determine + that it is safe to skip that particular file, instead of scanning all the associated + column values. This optimization technique is especially effective for tables that + use the <codeph>SORT BY</codeph> clause for the columns most frequently checked in + <codeph>WHERE</codeph> clauses, because any <codeph>INSERT</codeph> operation on + such tables produces Parquet data files with relatively narrow ranges of column values + within each file. + </p> + </conbody> <concept id="parquet_partitioning">
