[ 
https://issues.apache.org/jira/browse/SPARK-56660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084451#comment-18084451
 ] 

Anupam Yadav commented on SPARK-56660:
--------------------------------------

Hey [~eyck] , are you working on this? Would be happy to pick this up or 
collaborate with you on it, as you see fit.

Thanks!

> Optimizer: No file pruning for struct literal comparisons or tuple 
> comparisons, although equivalent scalar predicates would allow pruning
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56660
>                 URL: https://issues.apache.org/jira/browse/SPARK-56660
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 4.1.1
>         Environment: Spark 4.1.1
> DeltaTable 4.2
>            Reporter: Eyck Troschke
>            Priority: Major
>
> When filtering a Delta table on individual fields of a nested struct, Spark 
> correctly performs file pruning based on per‑column statistics 
> (min/max/nullCount). However, when the same filter is expressed as a struct 
> literal comparison or a tuple comparison, Spark does not perform file 
> pruning, even though these predicates are semantically equivalent and could 
> be rewritten into conjunctive scalar predicates.
> This results in unnecessary full scans and significantly worse performance 
> for large Delta tables.
> h2. *Reproduction*
> Given a Delta table with a nested struct column:
> {{CREATE TABLE t (}}
> {{id STRING,}}
> {{nested STRUCT<}}
> {{col1: STRING,}}
> {{col2: STRING,}}
> {{col3: STRING}}
> {{>}}
> {{) USING DELTA;}}
> h3. *Case 1 — Scalar predicates (file pruning works)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE nested.col1 = 'foo'}}
> {{AND nested.col2 = 'bar'}}
> {{AND nested.col3 = 'baz';}}
> Spark correctly prunes files based on column‑level statistics.
> h3. *Case 2 — Struct literal comparison (no pruning)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE nested = struct('foo', 'bar', 'baz');}}
> This predicate is semantically equivalent to the conjunctive scalar 
> predicates above, but Spark does *not* perform file pruning. The entire 
> dataset is scanned.
> h3. *Case 3 — Tuple comparison (no pruning)*
> {{SELECT *}}
> {{FROM t}}
> {{WHERE (nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz');}}
> Tuple comparisons are also not decomposed into scalar predicates, even though 
> they are logically equivalent and safe to rewrite.
> h2. *Expected Behavior*
> Spark should be able to rewrite:
> {{nested = struct('foo', 'bar', 'baz')}}
> and:
> {{(nested.col1, nested.col2, nested.col3) = ('foo', 'bar', 'baz')}}
> into:
> {{nested.col1 = 'foo'}}
> {{AND nested.col2 = 'bar'}}
> {{AND nested.col3 = 'baz'}}
> at least when:
>  * field names and order match
>  * all fields are primitive types
>  * the struct or tuple literal is deterministic
> This rewrite would allow Delta Lake to use existing per‑column statistics for 
> file pruning.
> h2. *Actual Behavior*
>  * Struct equality and tuple equality are treated as opaque predicates.
>  * The optimizer does not decompose them into scalar comparisons.
>  * Delta Lake cannot use per‑column statistics.
>  * No file pruning occurs.
>  * Performance degrades significantly for large tables.
> h2. *Why This Matters*
>  * Structs and tuples are common modeling patterns in Spark SQL.
>  * Many APIs and UDFs return structs, making struct comparisons natural.
>  * Users expect semantically equivalent predicates to have equivalent 
> performance.
>  * Other SQL engines rewrite tuple/row comparisons into conjunctive 
> predicates.
>  * The rewrite is deterministic and preserves semantics for primitive fields.
> h2. *Proposed Improvement*
> Introduce an optimizer rule that:
>  # Detects struct equality or tuple equality where both sides are:
>  * 
>  ** deterministic expressions
>  ** structs/tuples of identical length
>  ** containing only primitive fields
>  # Rewrites the comparison into a conjunction of field‑level comparisons.
> This would enable file pruning without changing query semantics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to