Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

via GitHub Wed, 24 Jun 2026 16:43:24 -0700


viirya commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3470957437



##########
docs/sql-arrow-cache-format.md:
##########
@@ -0,0 +1,312 @@
+# Apache Arrow Cache Format for Spark
+
+## Overview
+
+Apache Spark supports using Apache Arrow as an alternative cache format for 
in-memory Dataset caching. This format provides improved performance for 
certain workloads, especially when working with columnar data sources like 
Parquet and ORC.
+
+## Benefits
+
+The Arrow cache format offers several advantages over the default cache format:
+
+- **Zero-copy reads** when input is already in Arrow format (e.g., Arrow-based 
data sources, re-caching Arrow cached data)
+- **Better filter pushdown** with min/max statistics for partition pruning
+- **Off-heap memory management** via Arrow allocators
+- **Efficient compression** with zstd and lz4 codecs
+- **Arrow ecosystem interoperability** for data sharing
+
+**Note**: Spark's built-in Parquet/ORC readers use internal column vectors 
(`OnHeapColumnVector`/`OffHeapColumnVector`), not Arrow format, so they don't 
benefit from zero-copy optimization.
+
+## Configuration
+
+To enable Arrow cache format, set the static configuration:
+
+```scala
+spark.conf.set("spark.sql.cache.serializer",
+  "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer")

Review Comment:
   The cache serializer is selected once by the static 
`spark.sql.cache.serializer` conf, and once set it handles every cached 
relation -- Spark's cache framework has no per-relation fallback to a different 
serializer based on data type. So there's no automatic switch back to the 
default cache driver. The Arrow serializer aims to cover everything the default 
serializer does (and more -- e.g. geometry/geography, which the default has no 
encoding for). If a column uses a type Arrow genuinely cannot represent (e.g. 
`ObjectType`), materializing the cache throws `UNSUPPORTED_DATATYPE` rather 
than silently falling back. I've clarified both points in the docs.



##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
 
   // todo: support more types.
 
+  /**
+   * Check if a Spark DataType is supported by Arrow. This recursively checks 
complex types
+   * (Array, Struct, Map).
+   *
+   * Note: This checks compatibility with toArrowField(), not toArrowType(). 
Types like
+   * GeometryType, GeographyType, and VariantType are not supported by 
toArrowType() (which only
+   * handles primitive Arrow types), but ARE supported by toArrowField() which 
converts them to
+   * Arrow Struct representations with metadata. Since Arrow cache uses 
toArrowField() via
+   * toArrowSchema() to create the schema, these types are supported.
+   */
+  def isSupportedByArrow(dt: DataType): Boolean = {

Review Comment:
   Good catch -- "supports all Spark SQL data types" did overstate it, and the 
type list was also incomplete (it omitted Time, intervals, Geometry/Geography, 
Variant, Null, and UDTs, which are all supported). Fixed the doc to list the 
actually-supported set and to describe the unsupported-type behavior.
   
   One clarification on the mechanism, re: your fallback question: 
`isSupportedByArrow` here only gates `supportsColumnarInput`, which the cache 
framework uses to choose the columnar-vs-row *input* path into this same 
serializer -- it isn't a fallback to the default cache driver (there's no such 
per-type fallback). A truly unsupported type isn't silently dropped either: 
`toArrowSchema` throws `UNSUPPORTED_DATATYPE` when the cache is materialized. 
The docs now state this explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

Reply via email to