[
https://issues.apache.org/jira/browse/SEDONA-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608105#comment-17608105
]
Jia Yu commented on SEDONA-156:
-------------------------------
[~marcusrm] I really appreciate your effort on this. Just a heads-up, I don't
have a deep understanding about Spark DataSource V1 and V2 API. My two cents:
# We should use DataSource V2 API. The original contributor [~Ashar] does not
have a full understanding of V2, so she chose V1. I have no problem to change
it to V2.
# There are two things you might want to keep in mind
## Pushed filter: UDF function can be pushed down as a filter. Based on my
understanding, Sedona ST functions cannot be pushed down because UDFs in pure
Spark SQL are blackbox to Spark catalyst unless we do something with the
current Sedona ST functions.
## Partition filter: GeoParquet has this "BBox" statistics in each row store.
It can be used to filter out entire row store if the filter does not intersects
the BBox.
BulletPoint 2.1 "Pushed filter" is the thing you want. In fact it has nothing
to do with Parquet or GeoParquet. If you test Sedona ST functions using pure
SQL on CSV files, these filters not pushed. See below (using df.explain())
output:
PartitionFilters: [], PushedFilters: [],
However, Sedona implements all functions in Spark SQL Catalyst "Expressions"
[1] instead of the naive UDF. This gives you the possibility to push them down
to the data source (see [2]). There is an ongoing effort to enable Sedona ST
functions in type-safe format which bypasses the "udf.register" step (see [3])
So, with the current Sedona GeoParquet reader, and [3], it is possible that the
Pushed filter will be finally supported. You might want to check it out and
confirm my wild guess.
BulletPoint "Partition filter" is something even more difficult and it depends
on that if BulletPoint 2.1 "Pushed filter" can be solved.
[1]
https://github.com/apache/incubator-sedona/blob/master/sql/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Constructors.scala#L45
[2] [https://neapowers.com/apache-spark/native-functions-catalyst-expressions/]
[3] https://github.com/apache/incubator-sedona/pull/693
> predicate pushdown support for GeoParquet
> -----------------------------------------
>
> Key: SEDONA-156
> URL: https://issues.apache.org/jira/browse/SEDONA-156
> Project: Apache Sedona
> Issue Type: New Feature
> Reporter: RJ Marcus
> Priority: Major
> Fix For: 1.3.0
>
>
> Support for filter predicate for the new GeoParquet reader.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)