Re: [PR] [SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API [spark]

via GitHub Fri, 19 Jun 2026 03:54:59 -0700


peter-toth commented on code in PR #55885:
URL: https://github.com/apache/spark/pull/55885#discussion_r3441967481



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:
##########
@@ -92,24 +120,99 @@ case class TransformExpression(
    */
   def reducers(other: TransformExpression): Option[Reducer[_, _]] = {
     (function, other.function) match {
-      case(e1: ReducibleFunction[_, _], e2: ReducibleFunction[_, _]) =>
-        reducer(e1, numBucketsOpt, e2, other.numBucketsOpt)
+      case (e1: ReducibleFunction[_, _], e2: ReducibleFunction[_, _]) =>
+        reducer(e1, this, e2, other)
       case _ => None
     }
   }
 
-  // Return a Reducer for a reducible function on another reducible function
+  /**
+   * Extract all literal parameters of this transform as V2 [[V2Literal]]s, 
preserving each value's
+   * internal representation and its `DataType`. Connectors interpret the 
value via the accompanying
+   * `DataType` rather than relying on a pre-converted JVM type.
+   *
+   * Examples:
+   *   bucket(4, col)        => [Literal(4, IntegerType)]
+   *   truncate(col, 3)      => [Literal(3, IntegerType)]
+   *   days(col)             => []  (no literals)
+   */
+  private def extractParameters: Array[V2Literal[_]] =
+    literalChildren.map(l => LiteralValue(l.value, l.dataType): 
V2Literal[_]).toArray
+
+  /**
+   * Whether this transform and `other` share the same argument layout: equal 
arity, and at each
+   * position a literal slot aligns with a literal slot (and a non-literal 
with a non-literal).
+   * Literal *values* may differ -- that is what a [[Reducer]] reconciles.
+   */
+  private def sameArgumentLayout(other: TransformExpression): Boolean =

Review Comment:
   `sameArgumentLayout` and `isSameFunction` (L76-86) are now two separate 
`children.zip(...)` walkers. The literal axis differing is deliberate — 
equality for "same transform" vs. "values may differ, a `Reducer` reconciles 
them". But the non-literal handling drifted apart, and `sameArgumentLayout` 
ended up strictly weaker:
   
   | position pair | `isSameFunction` | `sameArgumentLayout` |
   |---|---|---|
   | (literal, literal) | `l1 == l2` | `true` — intended |
   | (transform, transform) | recursive `isSameFunction` | `case _ => true` |
   | (col, transform) | `isColumnRef && isColumnRef` | `case _ => true` |
   
   So `sameArgumentLayout` treats any two non-literals as interchangeable, and 
its correctness rests entirely on the `supportsExpressions` gate guaranteeing 
the single non-literal child is always a plain column ref. On its own it would 
let two structurally different transforms that share a function name and 
literal layout reduce (e.g. `bucket(4, days(c))` vs `bucket(4, hours(c))`).
   
   That's fine as the code stands. But since these were a single parameterized 
helper a commit ago (`childrenMatch`), it'd be nice to bring 
`sameArgumentLayout` back in sync with `isSameFunction` so the precondition is 
correct on its own rather than gate-dependent (future-proof):
   
   ```scala
   /** Per-position match; `literalsMatch` is the only axis that varies between 
callers. */
   private def childrenMatch(other: TransformExpression)
       (literalsMatch: (Literal, Literal) => Boolean): Boolean =
     children.length == other.children.length &&
       children.zip(other.children).forall {
         case (l1: Literal, l2: Literal) => literalsMatch(l1, l2)
         case (t1: TransformExpression, t2: TransformExpression) => 
t1.isSameFunction(t2)
         case (c1, c2) => TransformExpression.isColumnRef(c1) && 
TransformExpression.isColumnRef(c2)
       }
   
   def isSameFunction(other: TransformExpression): Boolean =
     function.canonicalName() == other.function.canonicalName() &&
       childrenMatch(other)(_ == _)
   
   // reducer precondition: same layout/structure, literal *values* may differ
   private def sameArgumentLayout(other: TransformExpression): Boolean =
     childrenMatch(other)((_, _) => true)
   ```
   
   Equally fine to leave as-is if you'd rather not touch it now — mainly 
flagging so the gate-dependency is a conscious choice.
   



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleFunction.java:
##########
@@ -60,6 +61,37 @@
 @Evolving
 public interface ReducibleFunction<I, O> {
 
+  /**
+   * Generic reducer for parameterized functions (bucket, truncate, etc.).
+   *
+   * If this function is 'reducible' on another function, return the {@link 
Reducer}.
+   * <p>
+   * Each parameter is a scalar {@link Literal} carrying both its value and 
data type. Parameters
+   * are always scalar literals (e.g. bucket numBuckets, truncate width); 
complex types (array, map,
+   * struct) are not passed here. {@link Literal#value()} is Spark's internal 
representation (e.g.
+   * {@code UTF8String} for strings, {@code Decimal} for decimals); use {@link 
Literal#dataType()}
+   * to interpret it.
+   * <p>
+   * Examples:
+   * <ul>
+   *     <li>bucket(4, x) and bucket(2, x): thisParams = [4], otherParams = 
[2]</li>
+   *     <li>truncate(x, 3) and truncate(x, 5): thisParams = [3], otherParams 
= [5]</li>
+   *     <li>hypothetical range_bucket(x, 0L, 100L, 4): thisParams = [0L, 
100L, 4]</li>
+   * </ul>
+   *
+   * @param thisParams literal parameters for this function
+   * @param otherFunction the other parameterized function
+   * @param otherParams literal parameters for the other function
+   * @return a reduction function if reducible, null otherwise
+   * @since 5.0.0

Review Comment:
   Version stamp: this should be `4.3.0`, not `5.0.0`. The companion 
SPARK-57038 merged to both `master` (5.0.0) and `branch-4.x` (4.3.0); assuming 
this PR is likewise backported, it first ships in 4.3.0, so per the "version of 
the branch it first ships in" rule both this `@since` and the 
`@Deprecated(since = "5.0.0")` on the deprecated overload below (L118) should 
read `4.3.0`. If this is intended master-only, ignore.
   
   ```suggestion
      * @since 4.3.0
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50593][SQL] SPJ: Support truncate transform via generalized ReducibleFunction API [spark]

Reply via email to