[jira] [Commented] (SEDONA-156) predicate pushdown support for GeoParquet

RJ Marcus (Jira) Wed, 21 Sep 2022 09:31:11 -0700


    [ 
https://issues.apache.org/jira/browse/SEDONA-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607857#comment-17607857
 ]


RJ Marcus commented on SEDONA-156:
----------------------------------

Hello [~jiayu] , I have spent a while working on SEDONA-156 and I think that my 
approach might be misguided. Maybe you can tell me if I am making a big mistake 
by overlooking something.

Background: the current code in sedona is using the DataSource V1 API. If we 
want to use the V2 API we need classes in 
{{{}org/apache/spark/sql/execution/datasources/v2/parquet/{}}}. E.g. 
{{GeoParquetDataSourceV2.scala}} . Otherwise it defaults to loading as a V1 
source via {{{}GeoParquetFileFormat.scala{}}}.

 

—--------------------

 

V1 of the DataSource API pushes down predicates of type {{sources.Filter}} 
which is a sealed class, so I can’t add more definitions for these new types of 
filters. I thought that probably we want the {{Expression}} (s) in 
{{Predicates.scala}} ({{{}ST_Within(){}}}, {{{}ST_Intersects(){}}}, etc) to be 
recognized as the predicate and passed down, e.g.
{code:java}
 
sparkSession.read.format("geoparquet").load(path).filter("ST_Within(geometry,ST_PolygonFromEnvelope(-1,
 -1, 1, 1))") {code}
.

Otherwise we could “hijack” the existing {{sources.Filter}} syntax e.g. :
{code:java}
 sparkSession.read.format("geoparquet").load(path).filter(“ WHERE 
point_df.geometry IN ST_PolygonFromEnvelope(-1, -1, 1, 1)”) {code}
? Everywhere else in the V1 API ({{{}GeoParquetFileFormat.scala{}}}, 
{{{}GeoParquetFilters.scala{}}}) it uses {{sources.Filter}}

So the first question is, am I missing something here? It seems like a fairly 
big hurdle that the {{sources.Filter}} class is sealed. (This even causes 
problems in DataSourceV2 API). Would we want to “hijack” the existing 
{{sources.Filter}} syntax instead of using {{ST_*()}} syntax?

 

 

After SPARK-36351 was merged in (spark 3.3+), {{FileScanBuilder.pushFilters}} 
uses {{Expression}} instead of {{Filter}} with the new interface 
{{{}SupportsPushDownCatalystFilters{}}}. Based on the discussion of that PR, it 
seems like this update is meant mostly for internal bookkeeping of data filters 
vs partition filters, but I think that we could use it to correctly pass our 
data filter {{Expression}} into a {{Predicate}} in {{GeoParquetScanBuilder}} 
and then into a user-defined {{FilterPredicate}} in {{GeoParquetFilters}} . 
This requires some geo* copies of the parquet versions of the files in 
{{org/apache/spark/sql/execution/datasources/v2/parquet/}} for the V2 API, and 
overriding several methods to deal with {{Predicate}} (s) instead of 
{{{}sources.Filter{}}}. This is what I’m working on right now and I think I'm 
pretty close, but I am worried that it may be too complicated of a solution and 
I am overlooking something.

Second question: Does this DataSourceV2 approach sound promising or totally off 
track? 



Thank you for taking the time to read. I am still learning the codebase, but I 
have a much better understanding than I did two weeks ago. Any feedback is 
appreciated.

> predicate pushdown support for GeoParquet
> -----------------------------------------
>
>                 Key: SEDONA-156
>                 URL: https://issues.apache.org/jira/browse/SEDONA-156
>             Project: Apache Sedona
>          Issue Type: New Feature
>            Reporter: RJ Marcus
>            Priority: Major
>             Fix For: 1.3.0
>
>
> Support for filter predicate for the new GeoParquet reader. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (SEDONA-156) predicate pushdown support for GeoParquet

Reply via email to