[
https://issues.apache.org/jira/browse/SPARK-40955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-40955:
---------------------------------
Summary: Allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter
(was: allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter )
> Allow DSV2 Predicate pushdown in FileScanBuilder.pushedDataFilter
> ------------------------------------------------------------------
>
> Key: SPARK-40955
> URL: https://issues.apache.org/jira/browse/SPARK-40955
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output, SQL
> Affects Versions: 3.3.1
> Reporter: RJ Marcus
> Priority: Major
>
> {+}overall{+}:
> Allow FileScanBuilder to push `Predicate` instead of `Filter` for data
> filters being pushed down to source. This would allow new (arbitrary) DS V2
> Predicates to be pushed down to the file source.
> Hello spark developers,
> Thank you in advance for reading. Please excuse me if I make mistakes; this
> is my first time working on apache/spark internals. I am asking these
> questions to better understand whether my proposed changes fall within the
> intended scope of Data Source V2 API functionality.
> +Motivation / Background:+
> I am working on a branch in
> [apache/incubator-sedona|https://github.com/apache/incubator-sedona] to
> extend its support of geoparquet files to include predicate pushdown of
> postGIS style spatial predicates (e.g. `ST_Contains()`) that can take
> advantage of spatial info in file metadata. We would like to inherit as much
> as possible from the Parquet classes (because geoparquet basically just adds
> a binary geometry column). However, {{FileScanBuilder.scala}} appears to be
> missing some functionality I need for DSV2 {{{}Predicates{}}}.
> +My understanding of the problem so far:+
> The ST_* {{Expression}} must be detected as a pushable predicate
> (ParquetScanBuilder.scala:71) and passed as a {{pushedDataFilter}} to the
> {{parquetPartitionReaderFactory}} where it will be translated into a (user
> defined) {{{}FilterPredicate{}}}.
> The [Filter class is
> sealed|https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala]
> so the sedona package can’t define new Filters; DSV2 Predicate appears to be
> the preferred method for accomplishing this task (referred to as “V2 Filter”,
> SPARK-39966). However, `pushedDataFilters` in FileScanBuilder.scala is of
> type {{{}sources.Filter{}}}.
> Some recent work (SPARK-39139) added the ability to detect user defined
> functions in {{DataSourceV2Strategy.translateFilterV2() >
> V2ExpressionBuilder.generateExpression()}} , which I think could accomplish
> detection correctly if {{FileScanBuilder}} called
> {{DataSourceV2Strategy.translateFilterV2()}} instead of
> {{{}DataSourceStrategy.translateFilter(){}}}.
> However, changing {{FileScanBuilder}} to use {{Predicate}} instead of
> {{Filter}} would require many changes to all file based data sources. I don’t
> want to spend effort making sweeping changes if the current behavior of Spark
> is intentional.
>
> +Concluding Questions:+
> Should {{FileScanBuilder}} be pushing {{Predicate}} instead of {{Filter}} for
> data filters being pushed down to source? Or maybe in a FileScanBuilderV2?
> If not, how can a developer of a data source push down a new (or user
> defined) predicate to the file source?
> Thank you again for reading. Pending feedback, I will start working on a PR
> for this functionality.
> [~beliefer] [~cloud_fan] [~huaxingao] have worked on DSV2 related spark
> issues and I welcome your input. Please ignore this if I "@" you incorrectly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]