sunchao commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3503142690
##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
// todo: support more types.
+ /**
+ * Check if a Spark DataType is supported by Arrow. This recursively checks
complex types
+ * (Array, Struct, Map).
+ *
+ * Note: This checks compatibility with toArrowField(), not toArrowType().
Types like
+ * GeometryType, GeographyType, and VariantType are not supported by
toArrowType() (which only
+ * handles primitive Arrow types), but ARE supported by toArrowField() which
converts them to
+ * Arrow Struct representations with metadata. Since Arrow cache uses
toArrowField() via
+ * toArrowSchema() to create the schema, these types are supported.
+ */
+ def isSupportedByArrow(dt: DataType): Boolean = {
+ dt match {
+ // Primitive types
+ case BooleanType | ByteType | ShortType | IntegerType | LongType |
FloatType | DoubleType |
+ _: StringType | BinaryType | NullType =>
+ true
+
+ // Decimal
+ case _: DecimalType => true
+
+ // Temporal types
+ case DateType | TimestampType | TimestampNTZType | _: TimeType => true
Review Comment:
[P2] The prerequisite cited here has now landed in this exact tree:
`DefaultCachedBatchSerializer` has nanosecond timestamp builders in
`ColumnBuilder.scala`, matching `ColumnType` cases and
`TimestampNanosColumnStats`. This predicate still rejects both
`TimestampNTZNanosType` and `TimestampLTZNanosType`, so the Arrow cache is now
less capable than the default cache even though the guide says it covers every
default-supported type. Since #56842 was the stated gate, please reconcile this
before merging by completing Arrow-cache support or narrowing the compatibility
claim.
_\[ :robot: posted by Codex on behalf of sunchao using the
code-review-for-me skill :robot: \]_
##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
// todo: support more types.
+ /**
+ * Check if a Spark DataType is supported by Arrow. This recursively checks
complex types
+ * (Array, Struct, Map).
+ *
+ * Note: This checks compatibility with toArrowField(), not toArrowType().
Types like
+ * GeometryType, GeographyType, and VariantType are not supported by
toArrowType() (which only
+ * handles primitive Arrow types), but ARE supported by toArrowField() which
converts them to
+ * Arrow Struct representations with metadata. Since Arrow cache uses
toArrowField() via
+ * toArrowSchema() to create the schema, these types are supported.
+ */
+ def isSupportedByArrow(dt: DataType): Boolean = {
+ dt match {
+ // Primitive types
+ case BooleanType | ByteType | ShortType | IntegerType | LongType |
FloatType | DoubleType |
+ _: StringType | BinaryType | NullType =>
+ true
+
+ // Decimal
+ case _: DecimalType => true
+
+ // Temporal types
+ case DateType | TimestampType | TimestampNTZType | _: TimeType => true
+
+ // Interval types
+ case _: YearMonthIntervalType | _: DayTimeIntervalType |
CalendarIntervalType => true
Review Comment:
[P2] The diagnostic fix still covers only top-level intervals.
`hasCalendarInterval` checks `attr.dataType == CalendarIntervalType`, while
`isSupportedByArrow` accepts CalendarInterval recursively inside arrays,
structs, maps, and UDTs, and their recursive Arrow writers still reach
`Math.multiplyExact(microseconds, 1000L)`. For example, an `array<interval>`
containing `Long.MaxValue / 1000 + 1` still escapes
`withIntervalOverflowTranslation` and fails with raw `ArithmeticException: long
overflow`, contrary to the guide and this reply. Please recurse through complex
types and `UserDefinedType.sqlType`, with nested overflow coverage.
_\[ :robot: posted by Codex on behalf of sunchao using the
code-review-for-me skill :robot: \]_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]