[
https://issues.apache.org/jira/browse/SPARK-34285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274458#comment-17274458
]
Attila Zsolt Piros commented on SPARK-34285:
--------------------------------------------
[~Xudingyu] predicate pushdown is extremely useful when a column group can be
dropped altogether.
To support this for each group statistics are stored in the Parquet. It
contains the min and max value.
In case of "StringStartsWith" you can see dropping the column groups is an easy
decision (let's say the min is "BBB" and the max is "EEE" in the current column
group):
- when the pattern is after the max (i.e "F.*") or
- when the pattern is before the min (i.e "A.*")
you can safely drop the whole column.
Regarding the "StringEndsWith" and "StringContains" you cannot make any
decision based on the min and max value.
> Implement Parquet StringEndsWith、StringContains Filter
> ------------------------------------------------------
>
> Key: SPARK-34285
> URL: https://issues.apache.org/jira/browse/SPARK-34285
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Xudingyu
> Priority: Major
>
> When create parquetFilters, currently only implements
> {code:java}
> case sources.StringStartsWith(name, prefix)
> {code}
> But there exists StringEndsWith、StringContains in
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala
> We can implements this two filters, and rename
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED
> {code}
> to
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_ENABLED
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]