[jira] [Commented] (SEDONA-156) predicate pushdown support for GeoParquet

RJ Marcus (Jira) Fri, 23 Sep 2022 11:18:07 -0700


    [ 
https://issues.apache.org/jira/browse/SEDONA-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608875#comment-17608875
 ]


RJ Marcus commented on SEDONA-156:
----------------------------------

Jia, thank you for the response! Overall it sounds like it’s headed in the 
right direction.

WRT “pushed filters” and “partition filters”, I thought that the _pushed_ 
filters are going to be using BBox during the scan process to skip files? I 
have extracted the bbox from the parquet metadata to pass into the 
pushedFilters sets of functions. I have been largely ignoring partition filters 
for now.

I took a look at the repositories in the neapowers website (itachi, bebe, 
sql-alchemy), and I am not convinced that they are applicable to this scenario 
because those Expressions are all data transformations (on existing sql 
datatypes) instead of predicate filters (so they don’t even have to worry about 
being pushed down anyway). If they did have custom predicate filters I think it 
would run into the same problem unless they use the operators that already 
exist in sql syntax ( <, > , =, OR, AND, NOT, >=, <=, IS NULL, IN, CONTAINS, 
etc. [see 
Filters|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala])

It’s good that the ST_* functions are defined as Expressions. I _think_ that we 
should be able to coerce them to work in the DatasourceV2 API.

The problem I mentioned earlier is that even though in V2 API the +pushFilters+ 
function takes in  {{Seq[Expression]}} , [the function that actually pushes the 
expression to the datasource is pushDataFilters. 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala#L90-L95]That
 one takes {{Array[Filter]}} which cannot be extended to allow new definitions 
for our ST_* predicates. The ST_Within predicate basically gets to that 
DataSourceStrategy.translateFilter and then fails because it can’t be 
translated into a {{Filter}} .
{code:java}
 override def pushFilters(filters: Seq[Expression]): Seq[Expression] = {
    val (partitionFilters, dataFilters) =
      DataSourceUtils.getPartitionFiltersAndDataFilters(partitionSchema, 
filters)
    this.partitionFilters = partitionFilters
    this.dataFilters = dataFilters
    val translatedFilters = mutable.ArrayBuffer.empty[sources.Filter]
    for (filterExpr <- dataFilters) {
      val translated = DataSourceStrategy.translateFilter(filterExpr, true)
      if (translated.nonEmpty) {
        translatedFilters += translated.get
      }
    }
    pushedDataFilters = pushDataFilters(translatedFilters.toArray)
    dataFilters
  }

  override def pushedFilters: Array[Predicate] = pushedDataFilters.map(_.toV2)

  /*
   * Push down data filters to the file source, so the data filters can be 
evaluated there to
   * reduce the size of the data to be read. By default, data filters are not 
pushed down.
   * File source needs to implement this method to push down data filters.
   */
  protected def pushDataFilters(dataFilters: Array[Filter]): Array[Filter] = 
Array.empty[Filter]
{code}
So, I _think_ that we can override +pushFilters+ and +pushedFilters+ to be able 
to translate the original filters (e.g. “ \{{ col1 == 5 }} “ ) the way it is 
doing now, then we ignore +pushDataFilters+ since it’s not used anywhere else. 
Finally, we rewrite a bunch of downstream functions in the existing 
ParquetScanBuilder/ParquetFilter which currently only deal with {{Filter}} . 
I’m attempting to rewrite these as minimally as possible.

I'll update with info after I've tried that

> predicate pushdown support for GeoParquet
> -----------------------------------------
>
>                 Key: SEDONA-156
>                 URL: https://issues.apache.org/jira/browse/SEDONA-156
>             Project: Apache Sedona
>          Issue Type: New Feature
>            Reporter: RJ Marcus
>            Priority: Major
>             Fix For: 1.3.0
>
>
> Support for filter predicate for the new GeoParquet reader. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (SEDONA-156) predicate pushdown support for GeoParquet

Reply via email to