schenksj opened a new pull request, #54669:
URL: https://github.com/apache/spark/pull/54669

   # [SPARK-55869][SQL] Extended Predicate Pushdown for DataSource V2
   
   ## What changes were proposed in this pull request?
   
   This PR extends Spark's DataSource V2 predicate pushdown framework with 
three layers of new functionality, all gated behind a single config switch 
(`spark.sql.dataSource.extendedPredicatePushdown.enabled`, default `true`).
   
   ### Layer 1: Capability-Gated Builtin Predicate Translation
   
   Data sources can now opt in to receiving additional builtin predicates by 
implementing `SupportsPushDownPredicateCapabilities` on their `ScanBuilder`. 
The interface declares a set of predicate names the source can handle:
   
   - `LIKE` — full pattern matching (`expr1 LIKE expr2`)
   - `RLIKE` — regex matching (`expr1 RLIKE expr2`)
   - `ILIKE` — case-insensitive LIKE (`expr1 ILIKE expr2`)
   - `IS_NAN` — NaN check (`isnan(expr)`)
   - `ARRAY_CONTAINS` — array element check (`array_contains(expr1, expr2)`)
   - `MAP_CONTAINS_KEY` — map key check (`map_contains_key(expr1, expr2)`)
   
   `V2ExpressionBuilder` checks the declared capabilities before translating 
these expressions, so sources that don't declare support continue to see 
existing behavior.
   
   ### Layer 2: Custom Predicate Functions via `SupportsCustomPredicates`
   
   Tables can declare custom predicate functions by implementing 
`SupportsCustomPredicates`, which returns an array of 
`CustomPredicateDescriptor` objects. Each descriptor specifies:
   
   - `canonicalName()` — dot-qualified name (e.g. `com.mycompany.MY_SEARCH`) 
used in the V2 `Predicate`
   - `sqlName()` — the unqualified name users write in SQL (e.g. `my_search`)
   - `parameterTypes()` — optional expected parameter types (enables automatic 
casting)
   - `isDeterministic()` — whether the predicate is deterministic
   
   Users write standard SQL function-call syntax: `SELECT * FROM t WHERE 
my_search(col, 'param')`. The analyzer resolves these against the table's 
descriptors and produces `CustomPredicateExpression` nodes, which 
`V2ExpressionBuilder` translates into V2 `Predicate` objects using the 
dot-qualified canonical name.
   
   A post-optimizer rule (`EnsureCustomPredicatesPushed`) fails the query if 
any custom predicate remains unpushed, since Spark cannot evaluate them locally.
   
   ### Layer 3: Custom Infix Operator Syntax via Parser Extensions
   
   For data sources that want infix operator syntax (e.g. `col INDEXQUERY 
'param'`), an abstract `CustomOperatorParserExtension` base class is provided. 
It rewrites infix expressions to function calls before the standard parser runs:
   
   ```
   col INDEXQUERY 'param'  →  INDEXQUERY(col, 'param')
   ```
   
   Data source authors extend `CustomOperatorParserExtension`, implement 
`customOperators`, and register via `SparkSessionExtensions.injectParser`.
   
   ## How was this patch tested?
   
   - **Unit tests** in `DataSourceV2StrategySuite`:
     - 7 tests for `CustomOperatorParserExtension` (infix rewriting, 
case-insensitivity, string literal preservation, multiple operators, etc.)
     - Existing V2 predicate translation tests continue to pass
   
   - **Integration tests** in `DataSourceV2Suite`:
     - Custom predicate pushdown end-to-end (function resolved, translated, 
pushed to scan)
     - Custom predicate with type casting (argument auto-cast to declared 
parameter types)
     - Capability-gated predicate pushdown (LIKE pushed only when declared)
   
   - **Regression suites** — all passing:
     - `DataSourceV2Suite` (45 tests)
     - `DataSourceV2StrategySuite` (28 tests)
     - `DataSourceV2FunctionSuite` (44 tests)
     - `V2PredicateSuite` (18 tests)
     - `SparkSessionExtensionSuite` (31 tests)
     - `AnalysisSuite` (82 tests)
     - `JDBCV2Suite` (79 tests)
   
   - **Style checks**: scalastyle (catalyst + core) and checkstyle (catalyst) 
all pass.
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   Yes.
   
   ## New Files
   
   | File | Module | Description |
   |------|--------|-------------|
   | `SupportsPushDownPredicateCapabilities.java` | catalyst | Interface on 
`ScanBuilder` declaring supported predicate names (Layer 1) |
   | `SupportsCustomPredicates.java` | catalyst | Interface on `Table` 
declaring custom predicate descriptors (Layer 2) |
   | `CustomPredicateDescriptor.java` | catalyst | Descriptor for a custom 
predicate function (Layer 2) |
   | `CustomPredicateExpression.scala` | catalyst | Catalyst expression node 
for resolved custom predicates (Layer 2) |
   | `ResolveCustomPredicates.scala` | catalyst | Analyzer rule resolving 
function calls against table descriptors (Layer 2) |
   | `EnsureCustomPredicatesPushed.scala` | catalyst | Post-optimizer rule 
ensuring custom predicates are pushed (Layer 2) |
   | `CustomOperatorParserExtension.scala` | catalyst | Abstract parser wrapper 
for infix operator rewriting (Layer 3) |
   
   ## Modified Files
   
   | File | Description |
   |------|-------------|
   | `Analyzer.scala` | Added `ResolveCustomPredicates` to Resolution batch; 
modified `LookupFunctions` to skip custom predicate names |
   | `V2ExpressionBuilder.scala` | Added capability-gated translation for 
LIKE/RLIKE/ILIKE/IS_NAN/ARRAY_CONTAINS/MAP_CONTAINS_KEY; added 
`CustomPredicateExpression` translation |
   | `PushDownUtils.scala` | Query `SupportsPushDownPredicateCapabilities` and 
pass extra capabilities to translation |
   | `DataSourceV2Strategy.scala` | Accept extra capabilities in 
`translateFilterV2WithMapping` |
   | `V2ExpressionSQLBuilder.java` | Fixed StackOverflowError for unknown 
predicate names in `toString()` |
   | `Predicate.java` | Updated Javadoc with new predicate names and custom 
predicate conventions |
   | `SQLConf.scala` | Added 
`spark.sql.dataSource.extendedPredicatePushdown.enabled` config |
   | `SparkOptimizer.scala` | Added `EnsureCustomPredicatesPushed` 
post-optimization rule |
   | `DataSourceV2StrategySuite.scala` | Added parser extension unit tests |
   | `DataSourceV2Suite.scala` | Added custom predicate and capability-gated 
predicate integration tests |
   
   ## Configuration
   
   | Config Key | Default | Description |
   |------------|---------|-------------|
   | `spark.sql.dataSource.extendedPredicatePushdown.enabled` | `true` | Master 
switch for all extended predicate pushdown features (Layers 1-3) |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to