This is an automated email from the ASF dual-hosted git repository.
bridgetb pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/drill.git
The following commit(s) were added to refs/heads/gh-pages by this push:
new f520885 DRILL-6744
f520885 is described below
commit f5208854c534ecba1490aea23f615b072011fea1
Author: Bridget Bevens <[email protected]>
AuthorDate: Fri Dec 14 13:00:12 2018 -0800
DRILL-6744
---
.../026-parquet-filter-pushdown.md | 51 ++++++++++++++++++++--
1 file changed, 47 insertions(+), 4 deletions(-)
diff --git a/_docs/performance-tuning/026-parquet-filter-pushdown.md
b/_docs/performance-tuning/026-parquet-filter-pushdown.md
index beebe78..a1c5416 100644
--- a/_docs/performance-tuning/026-parquet-filter-pushdown.md
+++ b/_docs/performance-tuning/026-parquet-filter-pushdown.md
@@ -1,6 +1,6 @@
---
title: "Parquet Filter Pushdown"
-date: 2018-09-28 21:35:21 UTC
+date: 2018-12-14
parent: "Performance Tuning"
---
@@ -23,10 +23,53 @@ Parquet filter pushdown is similar to partition pruning in
that it reduces the a
The query planner looks at the minimum and maximum values in each row group
for an intersection. If no intersection exists, the planner can prune the row
group in the table. If the minimum and maximum value range is too large, Drill
does not apply Parquet filter pushdown. The query planner can typically prune
more data when the tables in the Parquet file are sorted by row groups.
+##Parquet Filter Pushdown for VARCHAR and DECIMAL Data Types
+Starting in Drill 1.15, Drill supports Parquet filter pushdown for the VARCHAR
and DECIMAL data types. Drill uses binary statistics in the Parquet file or
Drill metadata file to push filters on VARCHAR and DECIMAL data types down to
the data source.
+
+###Parquet Generated Files
+By default, Parquet filter pushdown works for VARCHAR and DECIMAL data types
if the Parquet files were created with Parquet version 1.10.0 or later. Drill
1.13 and later uses Parquet 1.10.0 to write and read back Parquet files.
+
+If Parquet files were created with a pre-1.10.0 version of Parquet, and the
data in the binary columns is in ASCII format (not UTC-8), enable the
`store.parquet.reader.strings_signed_min_max` option, which allows Drill to use
binary statistics in older Parquet files.
+
+**Note:** DECIMAL filter pushdown only works for Parquet files created by
Parquet 1.10.0 or later due to issue
[PARQUET-1322](https://issues.apache.org/jira/browse/PARQUET-1322).
+
+###Parquet Files Created by Hive
+In Hive 2.3, Parquet files are created by a pre-1.10.0 version of Parquet. If
the data in the binary columns is in ASCII format, you can enable the
`store.parquet.reader.strings_signed_min_max` option to enable pushdown support
for VARCHAR data types. DECIMAL filter pushdown is not supported.
+
+###Drill Generated Metadata Files
+Parquet filter pushdown for DECIMAL and VARCHAR data types may not work
correctly on Drill metadata files that were generated prior to Drill 1.15.
Regenerate all Drill metadata files using Drill 1.15 or later to ensure that
Parquet filter pushdown works correctly on Drill generated metadata files.
+
+If the `store.parquet.reader.strings_signed_min_max` option is not enabled
during regeneration, the minimum and maximum values for the binary data will
not be written. When the binary data is in ASCII format, enabling the
`store.parquet.reader.strings_signed_min_max` option during regeneration
ensures that the minimum and maximum values are written and thus read back and
used during filter pushdown.
+
+###Enabling Statistics Use for Pre-1.10.0 Parquet Files
+If Parquet files were created with a pre-1.10.0 version of Parquet, and the
data in binary columns is in ASCII format (not UTF-8), you can enable Drill to
use the statistics for Parquet filter pushdown on VARCHAR and DECIMAL data
types.
+
+You can use either of the following methods to enable this functionality in
Drill:
+
+- In the `parquet` format plugin configuration, add the
`enableStringsSignedMinMax` option, and set the option to `true`, as shown:
+
+ "parquet" : {
+ type: "parquet",
+ enableStringsSignedMinMax: true
+ }
+
+ This configuration applies to all Parquet files in the `parquet` format
plugin to which this storage plugin points, including the configured workspaces.
+
+
+- From the command line, enable the
`store.parquet.reader.strings_signed_min_max` option at the session or system
level, as shown:
+
+ SET `store.parquet.reader.strings_signed_min_max`='true';
+ ALTER SYSTEM SET
`store.parquet.reader.strings_signed_min_max`='true';
+**Note:**
+ - The `store.parquet.reader.strings_signed_min_max` option allows three
values: `'true'`, `'false'`, `''`(empty string). By default, the value is an
empty string.
+ - Setting this option at the system level applies to all Parquet files
in the system. Alternatively, you can set this option in the Drill Web UI.
Options in the Drill Web UI are set at the system level.
+ - When set at the session level, the setting takes precedence over the
setting in the parquet format plugin and overrides the system level setting.
+
+
##Using Parquet Filter Pushdown
Currently, Parquet filter pushdown only supports filters that reference
columns from a single table (local filters). Parquet filter pushdown requires
the minimum and maximum values in the Parquet file metadata. All Parquet files
created in Drill using the CTAS statement contain the necessary metadata. If
your Parquet files were created using another tool, you may need to use Drill
to read and rewrite the files using the [CTAS
command]({{site.baseurl}}/docs/create-table-as-ctas/).
-Parquet filter pushdown works best if you presort the data. You do not have to
sort the entire data set at once. You can sort a subset of the data set, sort
another subset, and so on.
+Parquet filter pushdown works best if you presort the data. You do not have to
sort the entire data set at once. You can sort a subset of the data set, sort
another subset, and so on.
###Configuring Parquet Filter Pushdown
Use the [ALTER SYSTEM|SESSION SET]({{site.baseurl}}/docs/alter-system/)
command with the Parquet filter pushdown options to enable or disable the
feature, and set the number of row groups for a table.
@@ -63,14 +106,14 @@ This query performs a join on two tables partitioned by
the “month” column.
##Support
The following table lists the supported and unsupported clauses, operators,
data types, function, and scenarios for Parquet filter pushdown:
-**Note:** <sup>1</sup> indicates support as of Drill 1.13. <sup>2</sup>
indicates support as of Drill 1.14.
+**Note:** <sup>1</sup> indicates support as of Drill 1.13. <sup>2</sup>
indicates support as of Drill 1.14. <sup>3</sup> indicates support as of Drill
1.15.
| | Supported
| Not Supported
|
|----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|
| Clauses | WHERE, <sup>1</sup>WITH, HAVING (HAVING is
supported if Drill can pass the filter through GROUP BY.)
| -
|
| Operators | <sup>2</sup>BETWEEN, <sup>2</sup>ITEM, AND, OR,
NOT, <sup>1</sup>IS [NOT] NULL, <sup>1</sup>IS [NOT] TRUE, <sup>1</sup>IS [NOT]
FALSE, IN (An IN list is converted to OR if the number in the IN list is
within a certain threshold, for example 20. If greater than the threshold,
pruning cannot occur.) | - |
| Comparison Operators | <>, <, >, <=, >=, =
| -
|
-| Data Types | INT, BIGINT, FLOAT, DOUBLE, DATE, TIMESTAMP, TIME,
<sup>1</sup>BOOLEAN (true, false)
| CHAR, VARCHAR columns,
Hive TIMESTAMP |
+| Data Types | INT, BIGINT, FLOAT, DOUBLE, DATE, TIMESTAMP, TIME,
<sup>1</sup>BOOLEAN (true, false), <sup>3</sup>VARCHAR columns
| CHAR, Hive TIMESTAMP |
| Function | CAST is supported among the following types only:
int, bigint, float, double, <sup>1</sup>date, <sup>1</sup>timestamp, and
<sup>1</sup>time
| - |
| Other | <sup>2</sup>Enabled native Hive reader, Files with
multiple row groups, <sup>2</sup>Joins
| -
|