[
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051567#comment-17051567
]
DB Tsai commented on PARQUET-1809:
----------------------------------
cc [~rdblue]
In Spark, we don't run into issues reading / writing parquet files containing
`dots` as part of the column names. Can you elaborate the issues you are
referring? There are people using parquet files with `dots` in production with
Spark, and Spark has tests showing it works.
However, the only issue we have is Parquet's *filter2.predicate.FilterApi* is
using `dots` as separators, so we intentionally disable predicate pushdown with
column name containing `dots` in Spark.
I am adding the following code into Spark codebase to support predicate
pushdown for columns containing `dots` in parquet, and the idea is taking the
multi-identifiers to avoid using the separators.
[https://github.com/apache/spark/pull/27780/files#diff-3e66dccc7ac87df6d2767138dd65c2beR33]
Thanks,
> Add new APIs for nested predicate pushdown
> -------------------------------------------
>
> Key: PARQUET-1809
> URL: https://issues.apache.org/jira/browse/PARQUET-1809
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: DB Tsai
> Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is
> using *dot* to split the column name into multi-parts of nested fields. The
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for
> multi-parts of nested fields, so no confusion as using *dot* as a separator.
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)