[jira] [Commented] (PARQUET-1809) Add new APIs for nested predicate pushdown

Gabor Szadovszky (Jira) Wed, 04 Mar 2020 01:15:24 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051023#comment-17051023
 ]


Gabor Szadovszky commented on PARQUET-1809:
-------------------------------------------

I am afraid, it is not only the filter API that is affected by this problem. 
There are many places in the code that uses the _dot string_ to address nested 
columns.

If we allow dots in column names it would break backward compatibility: old 
readers will not be able to read new files.

Also, we address column names in configs (e.g. switching bloom filters on/off 
for specific columns) so we have to deal with the string-column path 
conversion. If dots are not separator characters any more what else shall we 
use that we won't allow to be in the column names? Or, should we implement an 
escaping mechanism for such characters?

>  Add new APIs for nested predicate pushdown
> -------------------------------------------
>
>                 Key: PARQUET-1809
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1809
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: DB Tsai
>            Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is 
> using *dot* to split the column name into multi-parts of nested fields. The 
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.  
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1809) Add new APIs for nested predicate pushdown

Reply via email to