schenksj opened a new pull request, #54669:
URL: https://github.com/apache/spark/pull/54669
# [SPARK-55869][SQL] Extended Predicate Pushdown for DataSource V2
## What changes were proposed in this pull request?
This PR extends Spark's DataSource V2 predicate pushdown framework with
three layers of new functionality, all gated behind a single config switch
(`spark.sql.dataSource.extendedPredicatePushdown.enabled`, default `true`).
### Layer 1: Capability-Gated Builtin Predicate Translation
Data sources can now opt in to receiving additional builtin predicates by
implementing `SupportsPushDownPredicateCapabilities` on their `ScanBuilder`.
The interface declares a set of predicate names the source can handle:
- `LIKE` — full pattern matching (`expr1 LIKE expr2`)
- `RLIKE` — regex matching (`expr1 RLIKE expr2`)
- `ILIKE` — case-insensitive LIKE (`expr1 ILIKE expr2`)
- `IS_NAN` — NaN check (`isnan(expr)`)
- `ARRAY_CONTAINS` — array element check (`array_contains(expr1, expr2)`)
- `MAP_CONTAINS_KEY` — map key check (`map_contains_key(expr1, expr2)`)
`V2ExpressionBuilder` checks the declared capabilities before translating
these expressions, so sources that don't declare support continue to see
existing behavior.
### Layer 2: Custom Predicate Functions via `SupportsCustomPredicates`
Tables can declare custom predicate functions by implementing
`SupportsCustomPredicates`, which returns an array of
`CustomPredicateDescriptor` objects. Each descriptor specifies:
- `canonicalName()` — dot-qualified name (e.g. `com.mycompany.MY_SEARCH`)
used in the V2 `Predicate`
- `sqlName()` — the unqualified name users write in SQL (e.g. `my_search`)
- `parameterTypes()` — optional expected parameter types (enables automatic
casting)
- `isDeterministic()` — whether the predicate is deterministic
Users write standard SQL function-call syntax: `SELECT * FROM t WHERE
my_search(col, 'param')`. The analyzer resolves these against the table's
descriptors and produces `CustomPredicateExpression` nodes, which
`V2ExpressionBuilder` translates into V2 `Predicate` objects using the
dot-qualified canonical name.
A post-optimizer rule (`EnsureCustomPredicatesPushed`) fails the query if
any custom predicate remains unpushed, since Spark cannot evaluate them locally.
### Layer 3: Custom Infix Operator Syntax via Parser Extensions
For data sources that want infix operator syntax (e.g. `col INDEXQUERY
'param'`), an abstract `CustomOperatorParserExtension` base class is provided.
It rewrites infix expressions to function calls before the standard parser runs:
```
col INDEXQUERY 'param' → INDEXQUERY(col, 'param')
```
Data source authors extend `CustomOperatorParserExtension`, implement
`customOperators`, and register via `SparkSessionExtensions.injectParser`.
## How was this patch tested?
- **Unit tests** in `DataSourceV2StrategySuite`:
- 7 tests for `CustomOperatorParserExtension` (infix rewriting,
case-insensitivity, string literal preservation, multiple operators, etc.)
- Existing V2 predicate translation tests continue to pass
- **Integration tests** in `DataSourceV2Suite`:
- Custom predicate pushdown end-to-end (function resolved, translated,
pushed to scan)
- Custom predicate with type casting (argument auto-cast to declared
parameter types)
- Capability-gated predicate pushdown (LIKE pushed only when declared)
- **Regression suites** — all passing:
- `DataSourceV2Suite` (45 tests)
- `DataSourceV2StrategySuite` (28 tests)
- `DataSourceV2FunctionSuite` (44 tests)
- `V2PredicateSuite` (18 tests)
- `SparkSessionExtensionSuite` (31 tests)
- `AnalysisSuite` (82 tests)
- `JDBCV2Suite` (79 tests)
- **Style checks**: scalastyle (catalyst + core) and checkstyle (catalyst)
all pass.
## Was this patch authored or co-authored using generative AI tooling?
Yes.
## New Files
| File | Module | Description |
|------|--------|-------------|
| `SupportsPushDownPredicateCapabilities.java` | catalyst | Interface on
`ScanBuilder` declaring supported predicate names (Layer 1) |
| `SupportsCustomPredicates.java` | catalyst | Interface on `Table`
declaring custom predicate descriptors (Layer 2) |
| `CustomPredicateDescriptor.java` | catalyst | Descriptor for a custom
predicate function (Layer 2) |
| `CustomPredicateExpression.scala` | catalyst | Catalyst expression node
for resolved custom predicates (Layer 2) |
| `ResolveCustomPredicates.scala` | catalyst | Analyzer rule resolving
function calls against table descriptors (Layer 2) |
| `EnsureCustomPredicatesPushed.scala` | catalyst | Post-optimizer rule
ensuring custom predicates are pushed (Layer 2) |
| `CustomOperatorParserExtension.scala` | catalyst | Abstract parser wrapper
for infix operator rewriting (Layer 3) |
## Modified Files
| File | Description |
|------|-------------|
| `Analyzer.scala` | Added `ResolveCustomPredicates` to Resolution batch;
modified `LookupFunctions` to skip custom predicate names |
| `V2ExpressionBuilder.scala` | Added capability-gated translation for
LIKE/RLIKE/ILIKE/IS_NAN/ARRAY_CONTAINS/MAP_CONTAINS_KEY; added
`CustomPredicateExpression` translation |
| `PushDownUtils.scala` | Query `SupportsPushDownPredicateCapabilities` and
pass extra capabilities to translation |
| `DataSourceV2Strategy.scala` | Accept extra capabilities in
`translateFilterV2WithMapping` |
| `V2ExpressionSQLBuilder.java` | Fixed StackOverflowError for unknown
predicate names in `toString()` |
| `Predicate.java` | Updated Javadoc with new predicate names and custom
predicate conventions |
| `SQLConf.scala` | Added
`spark.sql.dataSource.extendedPredicatePushdown.enabled` config |
| `SparkOptimizer.scala` | Added `EnsureCustomPredicatesPushed`
post-optimization rule |
| `DataSourceV2StrategySuite.scala` | Added parser extension unit tests |
| `DataSourceV2Suite.scala` | Added custom predicate and capability-gated
predicate integration tests |
## Configuration
| Config Key | Default | Description |
|------------|---------|-------------|
| `spark.sql.dataSource.extendedPredicatePushdown.enabled` | `true` | Master
switch for all extended predicate pushdown features (Layers 1-3) |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]