[
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051023#comment-17051023
]
Gabor Szadovszky commented on PARQUET-1809:
-------------------------------------------
I am afraid, it is not only the filter API that is affected by this problem.
There are many places in the code that uses the _dot string_ to address nested
columns.
If we allow dots in column names it would break backward compatibility: old
readers will not be able to read new files.
Also, we address column names in configs (e.g. switching bloom filters on/off
for specific columns) so we have to deal with the string-column path
conversion. If dots are not separator characters any more what else shall we
use that we won't allow to be in the column names? Or, should we implement an
escaping mechanism for such characters?
> Add new APIs for nested predicate pushdown
> -------------------------------------------
>
> Key: PARQUET-1809
> URL: https://issues.apache.org/jira/browse/PARQUET-1809
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: DB Tsai
> Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is
> using *dot* to split the column name into multi-parts of nested fields. The
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for
> multi-parts of nested fields, so no confusion as using *dot* as a separator.
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)