[jira] [Commented] (SEDONA-156) predicate pushdown support for GeoParquet

Jia Yu (Jira) Wed, 21 Sep 2022 23:48:06 -0700


    [ 
https://issues.apache.org/jira/browse/SEDONA-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608105#comment-17608105
 ]


Jia Yu commented on SEDONA-156:
-------------------------------

[~marcusrm] I really appreciate your effort on this. Just a heads-up, I don't 
have a deep understanding about Spark DataSource V1 and V2 API. My two cents:

 
 # We should use DataSource V2 API. The original contributor [~Ashar] does not 
have a full understanding of V2, so she chose V1. I have no problem to change 
it to V2.
 # There are two things you might want to keep in mind
 ## Pushed filter: UDF function can be pushed down as a filter. Based on my 
understanding, Sedona ST functions cannot be pushed down because UDFs in pure 
Spark SQL are blackbox to Spark catalyst unless we do something with the 
current Sedona ST functions.
 ## Partition filter: GeoParquet has this "BBox" statistics in each row store. 
It can be used to filter out entire row store if the filter does not intersects 
the BBox.

 

BulletPoint 2.1 "Pushed filter" is the thing you want. In fact it has nothing 
to do with Parquet or GeoParquet. If you test Sedona ST functions using pure 
SQL on CSV files, these filters not pushed. See below (using df.explain()) 
output:

PartitionFilters: [], PushedFilters: [],

 

However, Sedona implements all functions in Spark SQL Catalyst "Expressions" 
[1] instead of the naive UDF. This gives you the possibility to push them down 
to the data source (see [2]). There is an ongoing effort to enable Sedona ST 
functions in type-safe format which bypasses the "udf.register" step (see [3])

So, with the current Sedona GeoParquet reader, and [3], it is possible that the 
Pushed filter will be finally supported. You might want to check it out and 
confirm my wild guess.

 

BulletPoint "Partition filter" is something even more difficult and it depends 
on that if BulletPoint 2.1 "Pushed filter" can be solved.

 

[1] 
https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Constructors.scala#L45

[2] [https://neapowers.com/apache-spark/native-functions-catalyst-expressions/]

[3] https://github.com/apache/incubator-sedona/pull/693

 

> predicate pushdown support for GeoParquet
> -----------------------------------------
>
>                 Key: SEDONA-156
>                 URL: https://issues.apache.org/jira/browse/SEDONA-156
>             Project: Apache Sedona
>          Issue Type: New Feature
>            Reporter: RJ Marcus
>            Priority: Major
>             Fix For: 1.3.0
>
>
> Support for filter predicate for the new GeoParquet reader. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (SEDONA-156) predicate pushdown support for GeoParquet

Reply via email to