Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

via GitHub Sat, 27 Jun 2026 23:37:55 -0700


viirya commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3487468545



##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
 
   // todo: support more types.
 
+  /**
+   * Check if a Spark DataType is supported by Arrow. This recursively checks 
complex types
+   * (Array, Struct, Map).
+   *
+   * Note: This checks compatibility with toArrowField(), not toArrowType(). 
Types like
+   * GeometryType, GeographyType, and VariantType are not supported by 
toArrowType() (which only
+   * handles primitive Arrow types), but ARE supported by toArrowField() which 
converts them to
+   * Arrow Struct representations with metadata. Since Arrow cache uses 
toArrowField() via
+   * toArrowSchema() to create the schema, these types are supported.
+   */
+  def isSupportedByArrow(dt: DataType): Boolean = {
+    dt match {
+      // Primitive types
+      case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+          _: StringType | BinaryType | NullType =>
+        true
+
+      // Decimal
+      case _: DecimalType => true
+
+      // Temporal types
+      case DateType | TimestampType | TimestampNTZType | _: TimeType => true

Review Comment:
   Follow-up filed: SPARK-57735 / https://github.com/apache/spark/pull/56842 
adds nanosecond-timestamp support to the default in-memory cache 
(`DefaultCachedBatchSerializer`), which is the prerequisite -- once that lands, 
the Arrow cache can route these types through the same statistics machinery 
rather than rejecting them here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

Reply via email to