[
https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721839#comment-16721839
]
Bridget Bevens commented on DRILL-6744:
---------------------------------------
Hi [~arina],
Thank you for reviewing and providing feedback. I made changes accordingly and
posted the content here:
https://drill.apache.org/docs/parquet-filter-pushdown/#parquet-filter-pushdown-for-varchar-and-decimal-data-types
Best,
Bridget
> Support filter push down for varchar / decimal data types
> ---------------------------------------------------------
>
> Key: DRILL-6744
> URL: https://issues.apache.org/jira/browse/DRILL-6744
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.14.0
> Reporter: Arina Ielchiieva
> Assignee: Arina Ielchiieva
> Priority: Major
> Labels: doc-complete, ready-to-commit
> Fix For: 1.15.0
>
>
> Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly
> stored varchar / decimal min / max statistics is resolved, we should add
> support for varchar / decimal filter push down. Only files created with
> parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In
> cases if user knows that prior created files have correct min / max
> statistics (i.e. user exactly knows that data in binary columns in ASCII (not
> UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to
> enable filter push down.
> *Description*
> _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._
> *Varchar Partition Pruning*
> Varchar Pruning will work for files generated prior and after Parquet 1.10.0
> version, since to enable partition pruning both min and max values should be
> the same and there are no issues with incorrectly stored statistics for
> binary data for the same min and max values. Partition pruning using Drill
> metadata files will also work, no matter when metadata file was created
> (prior or after Drill 1.15.0).
> Partition pruning won't work for files where partition is null due to
> PARQUET-1341, issue will be fixed in Parquet 1.11.0.
> *Varchar Filter Push Down*
> Varchar filter push down will work for parquet files created with Parquet
> 1.10.0 and later.
> There are two options how to enable push down for files generated with prior
> Parquet versions, when user exactly knows that binary data is in ASCII (not
> UTF-8):
> 1. set configuration {{enableStringsSignedMinMax}} to true (false by default)
> for parquet format plugin:
> {noformat}
> "parquet" : {
> type: "parquet",
> enableStringsSignedMinMax: true
> }
> {noformat}
> This would apply to all parquet files of a given file plugin, including all
> workspaces.
> 2. If user wants to enable / disable allowing reading binary statistics for
> old parquet files per session, session option
> {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it
> has empty string value. Setting such option will take priority over config in
> parquet format plugin. Option allows three values: 'true', 'false', '' (empty
> string).
> _Note: store.parquet.reader.strings_signed_min_max also can be set at system
> level, thus it will apply to all parquet files in the system._
> The same config / session option will apply to allow reading binary
> statistics from Drill metadata files generated prior to Drill 1.15.0. If
> Drill metadata file was created prior to Drill 1.15.0 but for parquet files
> created with Parquet library 1.10.0 and later, user would have to enable
> config / session option or regenerate Drill metadata file with Drill 1.15.0
> or later, because from the metadata file we don't know if statistics is
> stored correctly (prior Drill was writing reading and writing binary
> statistics by default though did not use it).
> When creating Drill metadata file with Drill 1.15.0 and later for old parquet
> files, user should mind config / session option. If strings_signed_min_max is
> enabled, Drill will store in the Drill metadata file binary statistics but
> since metadata file was created with Drill 1.15.0 and later, Drill would read
> it back disregarding the option (assuming that if statistics is present in
> the Drill metadata file, it is correct). If user mistakenly enabled
> strings_signed_min_max, he needs to disable it and regenerated Drill metadata
> file. The same is in the opposite way, if user created metadata file when
> strings_signed_min_max was disabled, no min / max values for binary
> statistics will be written and thus read back, even if during reading the
> metadata strings_signed_min_max is enabled.
> *Decimal Partition Pruning*
> Decimal values can be represented in four logical types: int_32, int_64,
> fixed_len_byte_array and binary.
> Partition pruning will work for all logical types for old and new decimal
> files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning
> won't work for files with null partition due to PARQUET-1341 which will be
> fixed in Parquet 1.11.0.
> Partition pruning with Drill metadata file will work for old and new decimal
> files disregarding with which Drill version metadata file was created.
> *Decimal Filter Push Down*
> For int_32 / int_64 decimal push down will work only for new files (i.e.
> generated by Parquet 1.10.0 and later), for old files push down won't work
> due to PARQUET-1322.
> For old int_32 / int_64 decimal push down will work with old Drill metadata
> file, i.e. prior to Drill 1.14.0, for Drill metadata file generated after
> Drill 1.14.0 push down won't work since it is generated after upgrade to
> Parquet 1.10.0 (due to PARQUET-1322). For new int_32 / int_64 decimal, push
> down will work with new Drill metadata file.
> For old fixed_len_byte_array / binary decimal files generated prior to
> Parquet 1.10.0 filter push down won't work. Push down with old Drill metadata
> file only if strings_signed_min_max config / session option is set to true.
> Push down with new Drill metadata file won't work.
> For new fixed_len_byte_array / binary files filter push down will work with
> and without metadata file (only if Drill metadata file was generated by Drill
> 1.15.0). If Drill metadata file was generated prior to Drill 1.15.0, to
> enable reading such statistics user needs to enable strings_signed_min_max
> config / session option or re-generated Drill metadata file.
> *Hive Varchar Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0
> version, where statistics for binary UTF-8 is can be stored incorrectly. If
> user exactly knows that data in the binary columns in ASCIIĀ (not in UTF-8),
> session option store.parquet.reader.strings_signed_min_max can be set to
> 'true' to enable varchar filter push down.
> *Hive Decimal Filter Push Down using Drill native reader*
> Hive 2.3 parquet files are generated with Parquet library prior to 1.10.0
> version, decimal statistics for such files is not available thus push down
> won't work with Hive parquet decimal files.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)