[
https://issues.apache.org/jira/browse/SEDONA-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607857#comment-17607857
]
RJ Marcus commented on SEDONA-156:
----------------------------------
Hello [~jiayu] , I have spent a while working on SEDONA-156 and I think that my
approach might be misguided. Maybe you can tell me if I am making a big mistake
by overlooking something.
Background: the current code in sedona is using the DataSource V1 API. If we
want to use the V2 API we need classes in
{{{}org/apache/spark/sql/execution/datasources/v2/parquet/{}}}. E.g.
{{GeoParquetDataSourceV2.scala}} . Otherwise it defaults to loading as a V1
source via {{{}GeoParquetFileFormat.scala{}}}.
—--------------------
V1 of the DataSource API pushes down predicates of type {{sources.Filter}}
which is a sealed class, so I can’t add more definitions for these new types of
filters. I thought that probably we want the {{Expression}} (s) in
{{Predicates.scala}} ({{{}ST_Within(){}}}, {{{}ST_Intersects(){}}}, etc) to be
recognized as the predicate and passed down, e.g.
{code:java}
sparkSession.read.format("geoparquet").load(path).filter("ST_Within(geometry,ST_PolygonFromEnvelope(-1,
-1, 1, 1))") {code}
.
Otherwise we could “hijack” the existing {{sources.Filter}} syntax e.g. :
{code:java}
sparkSession.read.format("geoparquet").load(path).filter(“ WHERE
point_df.geometry IN ST_PolygonFromEnvelope(-1, -1, 1, 1)”) {code}
? Everywhere else in the V1 API ({{{}GeoParquetFileFormat.scala{}}},
{{{}GeoParquetFilters.scala{}}}) it uses {{sources.Filter}}
So the first question is, am I missing something here? It seems like a fairly
big hurdle that the {{sources.Filter}} class is sealed. (This even causes
problems in DataSourceV2 API). Would we want to “hijack” the existing
{{sources.Filter}} syntax instead of using {{ST_*()}} syntax?
After SPARK-36351 was merged in (spark 3.3+), {{FileScanBuilder.pushFilters}}
uses {{Expression}} instead of {{Filter}} with the new interface
{{{}SupportsPushDownCatalystFilters{}}}. Based on the discussion of that PR, it
seems like this update is meant mostly for internal bookkeeping of data filters
vs partition filters, but I think that we could use it to correctly pass our
data filter {{Expression}} into a {{Predicate}} in {{GeoParquetScanBuilder}}
and then into a user-defined {{FilterPredicate}} in {{GeoParquetFilters}} .
This requires some geo* copies of the parquet versions of the files in
{{org/apache/spark/sql/execution/datasources/v2/parquet/}} for the V2 API, and
overriding several methods to deal with {{Predicate}} (s) instead of
{{{}sources.Filter{}}}. This is what I’m working on right now and I think I'm
pretty close, but I am worried that it may be too complicated of a solution and
I am overlooking something.
Second question: Does this DataSourceV2 approach sound promising or totally off
track?
Thank you for taking the time to read. I am still learning the codebase, but I
have a much better understanding than I did two weeks ago. Any feedback is
appreciated.
> predicate pushdown support for GeoParquet
> -----------------------------------------
>
> Key: SEDONA-156
> URL: https://issues.apache.org/jira/browse/SEDONA-156
> Project: Apache Sedona
> Issue Type: New Feature
> Reporter: RJ Marcus
> Priority: Major
> Fix For: 1.3.0
>
>
> Support for filter predicate for the new GeoParquet reader.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)