Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

via GitHub Tue, 30 Jun 2026 20:42:22 -0700


sunchao commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3503142690



##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
 
   // todo: support more types.
 
+  /**
+   * Check if a Spark DataType is supported by Arrow. This recursively checks 
complex types
+   * (Array, Struct, Map).
+   *
+   * Note: This checks compatibility with toArrowField(), not toArrowType(). 
Types like
+   * GeometryType, GeographyType, and VariantType are not supported by 
toArrowType() (which only
+   * handles primitive Arrow types), but ARE supported by toArrowField() which 
converts them to
+   * Arrow Struct representations with metadata. Since Arrow cache uses 
toArrowField() via
+   * toArrowSchema() to create the schema, these types are supported.
+   */
+  def isSupportedByArrow(dt: DataType): Boolean = {
+    dt match {
+      // Primitive types
+      case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+          _: StringType | BinaryType | NullType =>
+        true
+
+      // Decimal
+      case _: DecimalType => true
+
+      // Temporal types
+      case DateType | TimestampType | TimestampNTZType | _: TimeType => true

Review Comment:
   [P2] The prerequisite cited here has now landed in this exact tree: 
`DefaultCachedBatchSerializer` has nanosecond timestamp builders in 
`ColumnBuilder.scala`, matching `ColumnType` cases and 
`TimestampNanosColumnStats`. This predicate still rejects both 
`TimestampNTZNanosType` and `TimestampLTZNanosType`, so the Arrow cache is now 
less capable than the default cache even though the guide says it covers every 
default-supported type. Since #56842 was the stated gate, please reconcile this 
before merging by completing Arrow-cache support or narrowing the compatibility 
claim.
   
   _\[ :robot: posted by Codex on behalf of sunchao using the 
code-review-for-me skill :robot: \]_



##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
 
   // todo: support more types.
 
+  /**
+   * Check if a Spark DataType is supported by Arrow. This recursively checks 
complex types
+   * (Array, Struct, Map).
+   *
+   * Note: This checks compatibility with toArrowField(), not toArrowType(). 
Types like
+   * GeometryType, GeographyType, and VariantType are not supported by 
toArrowType() (which only
+   * handles primitive Arrow types), but ARE supported by toArrowField() which 
converts them to
+   * Arrow Struct representations with metadata. Since Arrow cache uses 
toArrowField() via
+   * toArrowSchema() to create the schema, these types are supported.
+   */
+  def isSupportedByArrow(dt: DataType): Boolean = {
+    dt match {
+      // Primitive types
+      case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+          _: StringType | BinaryType | NullType =>
+        true
+
+      // Decimal
+      case _: DecimalType => true
+
+      // Temporal types
+      case DateType | TimestampType | TimestampNTZType | _: TimeType => true
+
+      // Interval types
+      case _: YearMonthIntervalType | _: DayTimeIntervalType | 
CalendarIntervalType => true

Review Comment:
   [P2] The diagnostic fix still covers only top-level intervals. 
`hasCalendarInterval` checks `attr.dataType == CalendarIntervalType`, while 
`isSupportedByArrow` accepts CalendarInterval recursively inside arrays, 
structs, maps, and UDTs, and their recursive Arrow writers still reach 
`Math.multiplyExact(microseconds, 1000L)`. For example, an `array<interval>` 
containing `Long.MaxValue / 1000 + 1` still escapes 
`withIntervalOverflowTranslation` and fails with raw `ArithmeticException: long 
overflow`, contrary to the guide and this reply. Please recurse through complex 
types and `UserDefinedType.sqlType`, with nested overflow coverage.
   
   _\[ :robot: posted by Codex on behalf of sunchao using the 
code-review-for-me skill :robot: \]_



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

Reply via email to