peter-toth commented on code in PR #55885:
URL: https://github.com/apache/spark/pull/55885#discussion_r3441967481
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/TransformExpression.scala:
##########
@@ -92,24 +120,99 @@ case class TransformExpression(
*/
def reducers(other: TransformExpression): Option[Reducer[_, _]] = {
(function, other.function) match {
- case(e1: ReducibleFunction[_, _], e2: ReducibleFunction[_, _]) =>
- reducer(e1, numBucketsOpt, e2, other.numBucketsOpt)
+ case (e1: ReducibleFunction[_, _], e2: ReducibleFunction[_, _]) =>
+ reducer(e1, this, e2, other)
case _ => None
}
}
- // Return a Reducer for a reducible function on another reducible function
+ /**
+ * Extract all literal parameters of this transform as V2 [[V2Literal]]s,
preserving each value's
+ * internal representation and its `DataType`. Connectors interpret the
value via the accompanying
+ * `DataType` rather than relying on a pre-converted JVM type.
+ *
+ * Examples:
+ * bucket(4, col) => [Literal(4, IntegerType)]
+ * truncate(col, 3) => [Literal(3, IntegerType)]
+ * days(col) => [] (no literals)
+ */
+ private def extractParameters: Array[V2Literal[_]] =
+ literalChildren.map(l => LiteralValue(l.value, l.dataType):
V2Literal[_]).toArray
+
+ /**
+ * Whether this transform and `other` share the same argument layout: equal
arity, and at each
+ * position a literal slot aligns with a literal slot (and a non-literal
with a non-literal).
+ * Literal *values* may differ -- that is what a [[Reducer]] reconciles.
+ */
+ private def sameArgumentLayout(other: TransformExpression): Boolean =
Review Comment:
`sameArgumentLayout` and `isSameFunction` (L76-86) are now two separate
`children.zip(...)` walkers. The literal axis differing is deliberate —
equality for "same transform" vs. "values may differ, a `Reducer` reconciles
them". But the non-literal handling drifted apart, and `sameArgumentLayout`
ended up strictly weaker:
| position pair | `isSameFunction` | `sameArgumentLayout` |
|---|---|---|
| (literal, literal) | `l1 == l2` | `true` — intended |
| (transform, transform) | recursive `isSameFunction` | `case _ => true` |
| (col, transform) | `isColumnRef && isColumnRef` | `case _ => true` |
So `sameArgumentLayout` treats any two non-literals as interchangeable, and
its correctness rests entirely on the `supportsExpressions` gate guaranteeing
the single non-literal child is always a plain column ref. On its own it would
let two structurally different transforms that share a function name and
literal layout reduce (e.g. `bucket(4, days(c))` vs `bucket(4, hours(c))`).
That's fine as the code stands. But since these were a single parameterized
helper a commit ago (`childrenMatch`), it'd be nice to bring
`sameArgumentLayout` back in sync with `isSameFunction` so the precondition is
correct on its own rather than gate-dependent (future-proof):
```scala
/** Per-position match; `literalsMatch` is the only axis that varies between
callers. */
private def childrenMatch(other: TransformExpression)
(literalsMatch: (Literal, Literal) => Boolean): Boolean =
children.length == other.children.length &&
children.zip(other.children).forall {
case (l1: Literal, l2: Literal) => literalsMatch(l1, l2)
case (t1: TransformExpression, t2: TransformExpression) =>
t1.isSameFunction(t2)
case (c1, c2) => TransformExpression.isColumnRef(c1) &&
TransformExpression.isColumnRef(c2)
}
def isSameFunction(other: TransformExpression): Boolean =
function.canonicalName() == other.function.canonicalName() &&
childrenMatch(other)(_ == _)
// reducer precondition: same layout/structure, literal *values* may differ
private def sameArgumentLayout(other: TransformExpression): Boolean =
childrenMatch(other)((_, _) => true)
```
Equally fine to leave as-is if you'd rather not touch it now — mainly
flagging so the gate-dependency is a conscious choice.
##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/ReducibleFunction.java:
##########
@@ -60,6 +61,37 @@
@Evolving
public interface ReducibleFunction<I, O> {
+ /**
+ * Generic reducer for parameterized functions (bucket, truncate, etc.).
+ *
+ * If this function is 'reducible' on another function, return the {@link
Reducer}.
+ * <p>
+ * Each parameter is a scalar {@link Literal} carrying both its value and
data type. Parameters
+ * are always scalar literals (e.g. bucket numBuckets, truncate width);
complex types (array, map,
+ * struct) are not passed here. {@link Literal#value()} is Spark's internal
representation (e.g.
+ * {@code UTF8String} for strings, {@code Decimal} for decimals); use {@link
Literal#dataType()}
+ * to interpret it.
+ * <p>
+ * Examples:
+ * <ul>
+ * <li>bucket(4, x) and bucket(2, x): thisParams = [4], otherParams =
[2]</li>
+ * <li>truncate(x, 3) and truncate(x, 5): thisParams = [3], otherParams
= [5]</li>
+ * <li>hypothetical range_bucket(x, 0L, 100L, 4): thisParams = [0L,
100L, 4]</li>
+ * </ul>
+ *
+ * @param thisParams literal parameters for this function
+ * @param otherFunction the other parameterized function
+ * @param otherParams literal parameters for the other function
+ * @return a reduction function if reducible, null otherwise
+ * @since 5.0.0
Review Comment:
Version stamp: this should be `4.3.0`, not `5.0.0`. The companion
SPARK-57038 merged to both `master` (5.0.0) and `branch-4.x` (4.3.0); assuming
this PR is likewise backported, it first ships in 4.3.0, so per the "version of
the branch it first ships in" rule both this `@since` and the
`@Deprecated(since = "5.0.0")` on the deprecated overload below (L118) should
read `4.3.0`. If this is intended master-only, ignore.
```suggestion
* @since 4.3.0
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]