cloud-fan commented on code in PR #55599:
URL: https://github.com/apache/spark/pull/55599#discussion_r3160524432
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/variant/variantExpressions.scala:
##########
@@ -952,3 +952,40 @@ case class SchemaOfVariantAgg(
override protected def withNewChildInternal(newChild: Expression):
Expression =
copy(child = newChild)
}
+
+@ExpressionDescription(
+ usage = "_FUNC_(v) - Returns true if the variant is valid, false if it is
malformed.",
+ examples = """
+ Examples:
+ > SELECT _FUNC_(parse_json('null'));
+ true
+ > SELECT _FUNC_(parse_json('[{"b":true,"a":0}]'));
+ true
+ """,
+ since = "4.2.0",
+ group = "variant_funcs"
+)
+case class IsValidVariant(child: Expression) extends UnaryExpression with
ExpectsInputTypes {
Review Comment:
Consider aligning this with the peer pattern used by `IsVariantNull` (line
101), `SchemaOfVariant`, and `ParseJson`: extend `Predicate` (which provides
`dataType = BooleanType`, making the override below redundant) and use
`RuntimeReplaceable` + `StaticInvoke` instead of `nullSafeEval` + a hand-rolled
`doGenCode`. Adding `def isValidVariant(input: VariantVal): Boolean =
VariantUtil.isValidVariant(input.getValue, input.getMetadata)` to
`VariantExpressionEvalUtils` and calling it through `StaticInvoke` would let
you drop both `nullSafeEval` and `doGenCode` here. There's nothing
instance-specific (no `addReferenceObj`-style state, unlike `ToVariantObject`)
that prevents the cleaner pattern.
##########
common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java:
##########
@@ -589,6 +590,80 @@ public static <T> T handleArray(byte[] value, int pos,
ArrayHandler<T> handler)
return handler.apply(size, offsetSize, offsetStart, dataStart);
}
+ // Validate whether a variant is well-formed. Returns true if the variant is
valid, false if it
+ // is malformed.
+ // Specifically, it returns true if and only if `toJson` doesn't throw an
exception. The
+ // implementation also has the same structure as `toJson` (see
`Variant.toJsonImpl`).
Review Comment:
Two refinements worth noting in this comment:
1. The "if and only if `toJson` doesn't throw" claim isn't strict.
`Variant.toJson` first goes through the `Variant` constructor, which throws
`VARIANT_CONSTRUCTOR_SIZE_LIMIT` for value/metadata over `SIZE_LIMIT` —
`isValidVariant` skips that check. A TIMESTAMP/TIMESTAMP_NTZ value out of
`Instant`'s range also passes `validateImpl` (which only calls `getLong`) but
trips `DateTimeException`/`ArithmeticException` from `Instant.EPOCH.plus(...)`
inside `toJsonImpl`. Consider softening to something like "returns true if the
variant binary is structurally well-formed (all bounds and type-info checks
pass)," or invoking `new Variant(...).toJson(...)` directly to honor the
literal contract.
2. The `try { ... } catch (SparkRuntimeException e)` is sound only because
every helper invoked inside `validateImpl` throws `MALFORMED_VARIANT` /
`UNKNOWN_PRIMITIVE_TYPE_IN_VARIANT` rather than raw
`ArrayIndexOutOfBoundsException` — and this PR's `getDecimalWithOriginalScale`
patch is exactly closing one such leak. A short note on `validateImpl` ("relies
on every helper throwing `SparkRuntimeException` rather than AIOOBE on
malformed input") would help future maintainers preserve that invariant when
adding cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]