[ 
https://issues.apache.org/jira/browse/PARQUET-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051567#comment-17051567
 ] 

DB Tsai commented on PARQUET-1809:
----------------------------------

cc [~rdblue]

 

In Spark, we don't run into issues reading / writing parquet files containing 
`dots` as part of the column names. Can you elaborate the issues you are 
referring? There are people using parquet files with `dots` in production with 
Spark, and Spark has tests showing it works.

However, the only issue we have is Parquet's *filter2.predicate.FilterApi* is 
using `dots` as separators, so we intentionally disable predicate pushdown with 
column name containing `dots` in Spark.

 

I am adding the following code into Spark codebase to support predicate 
pushdown for columns containing `dots` in parquet, and the idea is taking the 
multi-identifiers to avoid using the separators. 

[https://github.com/apache/spark/pull/27780/files#diff-3e66dccc7ac87df6d2767138dd65c2beR33]

 

Thanks,

 

>  Add new APIs for nested predicate pushdown
> -------------------------------------------
>
>                 Key: PARQUET-1809
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1809
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: DB Tsai
>            Priority: Major
>
> Currently, Parquet's *org.apache.parquet.filter2.predicate.FilterApi* is 
> using *dot* to split the column name into multi-parts of nested fields. The 
> drawback is that this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.  
> See https://github.com/apache/spark/pull/27728 and [SPARK-17636] for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to