Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

via GitHub Sat, 27 Jun 2026 20:14:49 -0700


viirya commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3487276946



##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
 
   // todo: support more types.
 
+  /**
+   * Check if a Spark DataType is supported by Arrow. This recursively checks 
complex types
+   * (Array, Struct, Map).
+   *
+   * Note: This checks compatibility with toArrowField(), not toArrowType(). 
Types like
+   * GeometryType, GeographyType, and VariantType are not supported by 
toArrowType() (which only
+   * handles primitive Arrow types), but ARE supported by toArrowField() which 
converts them to
+   * Arrow Struct representations with metadata. Since Arrow cache uses 
toArrowField() via
+   * toArrowSchema() to create the schema, these types are supported.
+   */
+  def isSupportedByArrow(dt: DataType): Boolean = {
+    dt match {
+      // Primitive types
+      case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+          _: StringType | BinaryType | NullType =>
+        true
+
+      // Decimal
+      case _: DecimalType => true
+
+      // Temporal types
+      case DateType | TimestampType | TimestampNTZType | _: TimeType => true
+
+      // Interval types
+      case _: YearMonthIntervalType | _: DayTimeIntervalType | 
CalendarIntervalType => true

Review Comment:
   Fixed with a clear diagnostic. Caching a CalendarInterval whose microseconds 
exceed +/-(Long.MaxValue / 1000) now throws an explanatory error (naming the 
type and the nanosecond-conversion limit) instead of an opaque 
`ArithmeticException: long overflow`. The check is installed only when the 
schema actually contains a CalendarInterval column, so there is no per-row cost 
for other schemas. Arrow's `IntervalMonthDayNano` is nanosecond-based and 
cannot losslessly hold the full Long microsecond domain, so I documented the 
value-range limit rather than changing the shared Arrow writer; the default 
serializer's lack of this restriction is noted in the docs.



##########
docs/sql-arrow-cache-format.md:
##########
@@ -0,0 +1,343 @@
+# Apache Arrow Cache Format for Spark

Review Comment:
   Fixed. Added the standard Jekyll front matter (`layout`, `title`, 
`displayTitle`, and the ASF license block) so the page produces 
`sql-arrow-cache-format.html`.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ArrowCachedBatchSerializer.scala:
##########
@@ -0,0 +1,1459 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.columnar
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
+import java.nio.channels.Channels
+
+import scala.jdk.CollectionConverters._
+
+import org.apache.arrow.compression.{Lz4CompressionCodec, ZstdCompressionCodec}
+import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot, VectorUnloader}
+import org.apache.arrow.vector.compression.{CompressionCodec, 
NoCompressionCodec}
+import org.apache.arrow.vector.ipc.{ReadChannel, WriteChannel}
+import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, 
MessageSerializer}
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Attribute, UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
+import org.apache.spark.sql.catalyst.types.DataTypeUtils
+import org.apache.spark.sql.columnar.{CachedBatch, 
SimpleMetricsCachedBatchSerializer}
+import org.apache.spark.sql.execution.arrow.ArrowWriter
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.util.ArrowUtils
+import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, 
ColumnVector}
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.Utils
+
+/**
+ * A [[CachedBatchSerializer]] that uses Apache Arrow as the cache format.
+ *
+ * This serializer:
+ *  - Supports both row-based (InternalRow) and columnar (ColumnarBatch) input
+ *  - Stores data in Arrow IPC streaming format with optional compression 
(zstd/lz4)
+ *  - Enables zero-copy columnar reads when output is ColumnarBatch
+ *  - Uses off-heap memory via Arrow allocators
+ *  - Collects per-column statistics for partition pruning
+ *  - Provides efficient interoperability with Arrow ecosystem
+ *
+ * Configuration options:
+ *  - spark.sql.cache.serializer: Set to this class name to enable
+ *  - spark.sql.execution.arrow.maxRecordsPerBatch: Max rows per cached batch
+ *  - spark.sql.execution.arrow.compression.codec: Compression (none/zstd/lz4)
+ *  - spark.sql.inMemoryColumnarStorage.enableVectorizedReader: Enable 
columnar output
+ */
+class ArrowCachedBatchSerializer extends SimpleMetricsCachedBatchSerializer {
+
+  override def supportsColumnarInput(schema: Seq[Attribute]): Boolean = {
+    // Check if all data types in the schema are supported by Arrow
+    schema.forall(attr => ArrowUtils.isSupportedByArrow(attr.dataType))
+  }
+
+  override def convertInternalRowToCachedBatch(
+      input: RDD[InternalRow],
+      schema: Seq[Attribute],
+      storageLevel: StorageLevel,
+      conf: SQLConf): RDD[CachedBatch] = {
+    // Capture config values on driver before RDD transformation
+    val sparkSchema = DataTypeUtils.fromAttributes(schema)
+    val maxRecordsPerBatch = conf.arrowMaxRecordsPerBatch
+    val maxBytesPerBatch = conf.arrowMaxBytesPerBatch
+    val timeZoneId = conf.sessionLocalTimeZone
+    val compressionCodecName = conf.arrowCompressionCodec
+    val compressionLevel = conf.arrowZstdCompressionLevel
+
+    input.mapPartitionsInternal { rowIterator =>
+      new InternalRowToArrowCachedBatchIterator(
+        rowIterator,
+        schema,
+        sparkSchema,
+        maxRecordsPerBatch,
+        maxBytesPerBatch,
+        timeZoneId,
+        compressionCodecName,
+        compressionLevel)
+    }
+  }
+
+  override def convertColumnarBatchToCachedBatch(
+      input: RDD[ColumnarBatch],
+      schema: Seq[Attribute],
+      storageLevel: StorageLevel,
+      conf: SQLConf): RDD[CachedBatch] = {
+    // Capture config values on driver before RDD transformation
+    val sparkSchema = DataTypeUtils.fromAttributes(schema)
+    val timeZoneId = conf.sessionLocalTimeZone
+    val compressionCodecName = conf.arrowCompressionCodec
+    val compressionLevel = conf.arrowZstdCompressionLevel
+
+    input.mapPartitionsInternal { batchIterator =>
+      new ColumnarBatchToArrowCachedBatchIterator(
+        batchIterator,
+        schema,
+        sparkSchema,
+        timeZoneId,
+        compressionCodecName,
+        compressionLevel)
+    }
+  }
+
+  override def supportsColumnarOutput(schema: StructType): Boolean = {
+    // Always support columnar output with Arrow
+    true
+  }
+
+  override def vectorTypes(attributes: Seq[Attribute], conf: SQLConf): 
Option[Seq[String]] = {
+    Option(Seq.fill(attributes.length)(classOf[ArrowColumnVector].getName))
+  }
+
+  override def convertCachedBatchToColumnarBatch(
+      input: RDD[CachedBatch],
+      cacheAttributes: Seq[Attribute],
+      selectedAttributes: Seq[Attribute],
+      conf: SQLConf): RDD[ColumnarBatch] = {
+    val cacheSchema = DataTypeUtils.fromAttributes(cacheAttributes)
+    val selectedSchema = DataTypeUtils.fromAttributes(selectedAttributes)
+    val columnIndices =
+      selectedAttributes.map(a => cacheAttributes.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+    // Capture config on driver
+    val timeZoneId = conf.sessionLocalTimeZone
+    val prefetchEnabled = conf.arrowCachePrefetchEnabled
+
+    input.mapPartitionsInternal { batchIterator =>
+      new ArrowCachedBatchToColumnarBatchIterator(
+        batchIterator,
+        cacheSchema,
+        selectedSchema,
+        columnIndices,
+        timeZoneId,
+        prefetchEnabled)
+    }
+  }
+
+  override def convertCachedBatchToInternalRow(
+      input: RDD[CachedBatch],
+      cacheAttributes: Seq[Attribute],
+      selectedAttributes: Seq[Attribute],
+      conf: SQLConf): RDD[InternalRow] = {
+    val cacheSchema = DataTypeUtils.fromAttributes(cacheAttributes)
+    val selectedSchema = DataTypeUtils.fromAttributes(selectedAttributes)
+    val timeZoneId = conf.sessionLocalTimeZone
+
+    // Calculate column indices for projection
+    val selectedIndices = selectedAttributes.map { attr =>
+      cacheAttributes.indexWhere(_.exprId == attr.exprId)
+    }.toArray
+
+    // Check if all selected types can use the fast path.
+    // Types not handled by ArrowColumnReader must use the fallback path.
+    val needsFallback = selectedSchema.fields.exists { f =>
+      f.dataType match {
+        case _: ArrayType | _: StructType | _: MapType => true
+        case CalendarIntervalType | VariantType | NullType => true
+        case _: UserDefinedType[_] => true
+        // Geometry/Geography are represented as an Arrow struct (srid + wkb); 
the fast-path
+        // ArrowColumnReader does not handle them, so route them through the 
fallback.
+        case _: GeometryType | _: GeographyType => true
+        case _ => false
+      }
+    }
+
+    if (needsFallback) {
+      // Fall back to columnar-to-row conversion via ColumnarBatch for complex 
types.
+      // Use UnsafeProjection to convert ColumnarBatchRow to UnsafeRow.
+      convertCachedBatchToColumnarBatch(input, cacheAttributes, 
selectedAttributes, conf)
+        .mapPartitionsInternal { batchIter =>
+          val toUnsafe = 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection.create(
+            selectedSchema)
+          batchIter.flatMap { batch =>
+            val numRows = batch.numRows()
+            new Iterator[InternalRow] {
+              private var rowIdx = 0
+              override def hasNext: Boolean = rowIdx < numRows
+              override def next(): InternalRow = {
+                val row = batch.getRow(rowIdx)
+                rowIdx += 1
+                toUnsafe(row)
+              }
+            }
+          }
+        }
+    } else {
+      val prefetchEnabled = conf.arrowCachePrefetchEnabled
+      input.mapPartitionsInternal { batchIterator =>
+        new ArrowCachedBatchToInternalRowIterator(
+          batchIterator,
+          cacheSchema,
+          selectedSchema,
+          selectedIndices,
+          timeZoneId,
+          prefetchEnabled)
+      }
+    }
+  }
+}
+
+/**
+ * Companion object with shared utility methods for Arrow cache serialization.
+ */
+private object ArrowCachedBatchSerializer {
+
+  // scalastyle:off caselocale
+  def createCompressionCodec(
+      codecName: String,
+      compressionLevel: Int): CompressionCodec = {
+    codecName.toLowerCase match {
+      case "none" => NoCompressionCodec.INSTANCE
+      // The codec instance must be constructed directly so that 
compressionLevel is honored:
+      // CompressionCodec.Factory.createCodec(codecType) ignores the level and 
builds a codec at
+      // the default level. The level only matters on the write side; the read 
side looks up the
+      // codec by the type recorded in the IPC message.
+      case "zstd" => new ZstdCompressionCodec(compressionLevel)
+      case "lz4" => new Lz4CompressionCodec()
+      case other =>
+        throw SparkException.internalError(
+          s"Unsupported Arrow compression codec: $other. Supported values: 
none, zstd, lz4")
+    }
+  }
+  // scalastyle:on caselocale
+
+  def serializeBatch(batch: ArrowRecordBatch): Array[Byte] = {
+    val out = new ByteArrayOutputStream()
+    val writeChannel = new WriteChannel(Channels.newChannel(out))
+    MessageSerializer.serialize(writeChannel, batch)
+    out.toByteArray
+  }
+
+  /**
+   * Shut down a prefetch worker during task cleanup without leaking the root 
it may have produced.
+   *
+   * The prefetch worker deserializes the next batch into a fresh 
[[VectorSchemaRoot]] off-thread.
+   * If task completion runs while a result is in flight (e.g. a LIMIT 
consumer stops early),
+   * cancelling and discarding the future would drop a root that was already 
(or is about to be)
+   * produced, and the subsequent `allocator.close()` would fail with "Memory 
was leaked by query".
+   *
+   * This stops accepting new work, waits for the worker to finish so no root 
is produced after we
+   * stop looking, then closes any completed result. Always returns null so 
the caller can null out
+   * its future reference. Safe to call with a null executor or future.
+   */
+  def drainAndClosePrefetch(
+      executor: java.util.concurrent.ExecutorService,
+      future: java.util.concurrent.Future[VectorSchemaRoot]): 
java.util.concurrent.Future[
+        VectorSchemaRoot] = {
+    if (executor != null) {
+      // Stop accepting new tasks and wait for the in-flight deserialization 
to finish, rather than
+      // interrupting it: an interrupt mid-allocation can race allocator 
shutdown and still leak.
+      executor.shutdown()
+      try {
+        executor.awaitTermination(Long.MaxValue, 
java.util.concurrent.TimeUnit.NANOSECONDS)
+      } catch {
+        case _: InterruptedException =>
+          Thread.currentThread().interrupt()
+          executor.shutdownNow()
+      }
+    }
+    if (future != null) {
+      try {
+        // The worker has terminated, so this does not block; close the root 
it produced.
+        val root = future.get()
+        if (root != null) {
+          root.close()
+        }
+      } catch {
+        // The batch was never produced (cancelled/failed); nothing to close.
+        case _: java.util.concurrent.CancellationException =>
+        case _: java.util.concurrent.ExecutionException =>
+        case _: InterruptedException => Thread.currentThread().interrupt()
+      }
+    }
+    null
+  }
+
+  def createColumnStats(dataType: DataType): ColumnStats = {
+    dataType match {
+      case BooleanType => new BooleanColumnStats
+      case ByteType => new ByteColumnStats
+      case ShortType => new ShortColumnStats
+      case IntegerType => new IntColumnStats
+      case DateType => new IntColumnStats  // Date is stored as Int
+      case LongType => new LongColumnStats
+      case TimestampType => new LongColumnStats  // Timestamp is stored as Long
+      case TimestampNTZType => new LongColumnStats  // TimestampNTZ is stored 
as Long
+      case FloatType => new FloatColumnStats
+      case DoubleType => new DoubleColumnStats
+      case st: StringType => new StringColumnStats(st)
+      case BinaryType => new BinaryColumnStats
+      case dt: DecimalType => new DecimalColumnStats(dt)
+      case CalendarIntervalType => new IntervalColumnStats
+      case _: YearMonthIntervalType => new IntColumnStats   // stored as Int
+      case _: DayTimeIntervalType => new LongColumnStats  // stored as Long
+      case _: TimeType => new LongColumnStats  // Time is stored as Long 
(nanoseconds)
+      case VariantType => new VariantColumnStats
+      // Geometry/Geography are stored as binary (WKB) internally, so reuse 
BinaryColumnStats
+      // to collect size/count without min/max bounds. They are AtomicTypes 
that ColumnType
+      // (used by ObjectColumnStats) does not handle, so they must be matched 
explicitly here.
+      case _: GeometryType | _: GeographyType => new BinaryColumnStats

Review Comment:
   Fixed. Added a BinaryView-aware `GeoColumnStats` and routed 
Geometry/Geography to it in `createColumnStats`. It reads the value via 
`getBinaryView` (matching how `ArrowWriter` consumes it) instead of 
`row.getBinary`, so a `GenericInternalRow` storing a `BinaryView` no longer 
throws `ClassCastException`. Added a test that drives the collector with a 
`BinaryView` in a generic row.



##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
 
   // todo: support more types.
 
+  /**
+   * Check if a Spark DataType is supported by Arrow. This recursively checks 
complex types
+   * (Array, Struct, Map).
+   *
+   * Note: This checks compatibility with toArrowField(), not toArrowType(). 
Types like
+   * GeometryType, GeographyType, and VariantType are not supported by 
toArrowType() (which only
+   * handles primitive Arrow types), but ARE supported by toArrowField() which 
converts them to
+   * Arrow Struct representations with metadata. Since Arrow cache uses 
toArrowField() via
+   * toArrowSchema() to create the schema, these types are supported.
+   */
+  def isSupportedByArrow(dt: DataType): Boolean = {
+    dt match {
+      // Primitive types
+      case BooleanType | ByteType | ShortType | IntegerType | LongType | 
FloatType | DoubleType |
+          _: StringType | BinaryType | NullType =>
+        true
+
+      // Decimal
+      case _: DecimalType => true
+
+      // Temporal types
+      case DateType | TimestampType | TimestampNTZType | _: TimeType => true

Review Comment:
   Good catch that master now maps these through Arrow. I looked at the cache 
paths, though, and the physical value for 
`TimestampNTZNanosType`/`TimestampLTZNanosType` is a `TimestampNanosVal`, not a 
plain `Long`, so the cache's stats collector and fast columnar reader can't 
treat them as long-backed without a dedicated, precision-aware path -- that's 
net-new support rather than a predicate tweak. Importantly there's no parity 
regression: the default cache serializer doesn't support these types either. I 
verified on current master that `df.cache()` of a `TIMESTAMP_NTZ(9)` column 
throws `not support type: TimestampNTZNanosType(9)` with the default serializer 
(its `ColumnBuilder` has no case for them). The Arrow serializer now rejects 
them with a clear `checkSupportedSchema` error at materialization. I'd rather 
add real support as a focused follow-up than land a half-correct fast path here 
-- let me know if you'd prefer it gated differently in the meantime.



##########
docs/sql-arrow-cache-format.md:
##########
@@ -0,0 +1,343 @@
+# Apache Arrow Cache Format for Spark
+
+## Overview
+
+Apache Spark supports using Apache Arrow as an alternative cache format for 
in-memory Dataset caching. This format provides improved performance for 
certain workloads, especially when working with columnar data sources like 
Parquet and ORC.
+
+## Benefits
+
+The Arrow cache format offers several advantages over the default cache format:
+
+- **Zero-copy reads** when input is already in Arrow format (e.g., Arrow-based 
data sources, re-caching Arrow cached data)
+- **Better filter pushdown** with min/max statistics for partition pruning
+- **Off-heap memory management** via Arrow allocators
+- **Efficient compression** with zstd and lz4 codecs
+- **Arrow ecosystem interoperability** for data sharing
+
+**Note**: Spark's built-in Parquet/ORC readers use internal column vectors 
(`OnHeapColumnVector`/`OffHeapColumnVector`), not Arrow format, so they don't 
benefit from zero-copy optimization.
+
+## Configuration
+
+`spark.sql.cache.serializer` is a static SQL configuration, so it must be set 
when the
+SparkSession is built and cannot be changed on a running session 
(`spark.conf.set` rejects static
+keys with `CANNOT_MODIFY_CONFIG`):
+
+```scala
+val spark = SparkSession.builder()
+  .appName("MyApp")
+  .config("spark.sql.cache.serializer",
+    "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer")
+  .getOrCreate()
+```
+
+**Note**: This config selects the cache serializer for the whole session; once 
set, this
+serializer handles every cached relation. There is no automatic per-relation 
fallback to another
+cache serializer based on the data types involved (see
+[Supported Data Types](#supported-data-types) for how unsupported types are 
handled). The chosen
+serializer is also cached process-wide on first use, so switching cache 
formats within a JVM that
+has already materialized a cache requires a fresh JVM (see
+[Migration from Default Cache](#migration-from-default-cache)).
+
+## Usage
+
+Once configured, use cache operations as normal:
+
+```scala
+// Cache a DataFrame
+val df = spark.read.parquet("data.parquet")
+df.cache()
+
+// Use cached data
+df.filter("age > 30").count()
+
+// Uncache when done
+df.unpersist()
+```
+
+## Compression
+
+Arrow cache supports multiple compression codecs. Configure compression with:
+
+```scala
+spark.conf.set("spark.sql.execution.arrow.compression.codec", "zstd")
+```
+
+Available options:
+- `none` - No compression (fastest, largest size, **default**)
+- `lz4` - LZ4 compression (fast, good compression)
+- `zstd` - Zstandard compression (slower, best compression)
+
+For zstd, you can also configure the compression level. Positive values (up to 
22) give better
+compression but slower speed; negative values give ultra-fast compression with 
lower ratios:
+
+```scala
+spark.conf.set("spark.sql.execution.arrow.compression.zstd.level", "3")  // 
Default: 3
+```
+
+## Vectorized Reader
+
+Enable vectorized reading for better performance with primitive types:
+
+```scala
+spark.conf.set("spark.sql.inMemoryColumnarStorage.enableVectorizedReader", 
"true")
+```
+
+When enabled, cached data is read as columnar batches instead of rows, which 
can significantly improve performance for columnar operations.
+
+## Performance Characteristics
+
+In our benchmarks, the Arrow cache format performs best on the following 
workloads. Actual
+results depend on data types, compression settings, and hardware, and the 
default cache format
+can be faster in some cases (for example, with higher compression levels):
+
+1. **Filter-Heavy Workloads**: Queries with selective filters benefit from 
min/max statistics.
+2. **Columnar Operations**: Aggregations and projections on cached data 
benefit from the Arrow format.
+3. **Parquet/ORC Caching**: Arrow's batch processing helps even without the 
zero-copy path.
+4. **Re-caching with Column Projection**: Dropping columns from Arrow-cached 
data preserves the
+   `ArrowColumnVector` format, enabling true zero-copy extraction and the 
largest gains.
+
+### Benchmark Results
+
+The numbers below are illustrative results from one run on an Apple M4 Max 
(OpenJDK 21.0.8) and
+will vary with hardware, JDK, and compression settings. They are not a 
guarantee. For the
+authoritative, regularly regenerated numbers, see
+`sql/core/benchmarks/ArrowCacheBenchmark-jdk21-results.txt` and the 
`ArrowCacheBenchmark` suite.
+
+| Workload | Default Cache | Arrow Cache | Speedup |
+|----------|--------------|-------------|---------|
+| Write + Read (5M rows, 3 primitive columns) | 153.7 ns/row | 74.2 ns/row | 
**~2X faster** |
+| Cache then filter (5M rows) | 100.1 ns/row | 70.8 ns/row | **~1.4X faster** |
+| Columnar input from Parquet (2M rows, 3 primitive columns) | 195.3 ns/row | 
113.1 ns/row | **~1.7X faster** |
+| Re-cache with zero-copy (2M rows, 2 columns) | 123.3 ns/row | 38.5 ns/row | 
**~3.2X faster** |
+
+**Notes**:
+- **Write + Read**: Significant improvement from efficient Arrow serialization 
and vectorized operations
+- **Cache then filter**: This measures end-to-end cache build plus a filtered 
scan, comparing the two cache formats. Both formats collect min/max statistics 
and can prune batches, so the difference reflects overall cache+scan throughput 
rather than pruning unique to Arrow
+- **Parquet caching**: Shows improvement despite Spark's Parquet reader 
producing `OnHeapColumnVector`/`OffHeapColumnVector` rather than 
`ArrowColumnVector`, due to Arrow's efficient batch processing
+- **Re-cache with zero-copy**: When caching a subset of columns from 
Arrow-cached data (e.g., `df.drop("column")`), the remaining columns preserve 
their `ArrowColumnVector` format, enabling true zero-copy extraction and 
achieving the best performance
+- **Zero-copy benefits** only apply when input is already `ArrowColumnVector` 
(e.g., Python Arrow sources, re-caching Arrow cached data with column 
projection)
+
+## Supported Data Types
+
+Arrow cache supports the following data types:
+
+### Primitive Types
+- BooleanType
+- ByteType, ShortType, IntegerType, LongType
+- FloatType, DoubleType
+- DecimalType (all precision/scale combinations)
+- NullType
+
+### Temporal Types
+- DateType
+- TimestampType
+- TimestampNTZType
+- TimeType
+
+### Interval Types
+- YearMonthIntervalType
+- DayTimeIntervalType
+- CalendarIntervalType
+
+### String and Binary
+- StringType (including collated strings)
+- BinaryType
+
+### Complex Types
+- ArrayType
+- StructType
+- MapType
+- Nested combinations of the above
+
+### Other Types
+- VariantType
+- GeometryType, GeographyType
+- User-defined types (UDTs) whose underlying representation is itself supported
+
+### Unsupported Types
+
+Arrow cache covers every type the default cache serializer supports, plus some 
it
+does not (for example geometry and geography). Types that Arrow cannot 
represent
+(such as `ObjectType`) are not silently dropped or routed to a different cache
+serializer: there is no per-type fallback, because the cache serializer is 
chosen
+once via the static `spark.sql.cache.serializer` configuration and then handles
+every cached relation. Attempting to cache an unsupported type fails with an
+`UNSUPPORTED_DATATYPE` error when the cache is materialized.
+
+## Statistics and Filter Pushdown
+
+Arrow cache automatically collects min/max statistics for the following types:
+- Boolean
+- Numeric types (Byte, Short, Int, Long, Float, Double)
+- Decimal
+- Date, Timestamp, and Timestamp without time zone (TIMESTAMP_NTZ)
+- Time
+- Year-month and day-time intervals
+- String (using collation-aware comparison for collated strings)
+
+Other types (Binary, Variant, calendar intervals, and complex types such as
+Array/Struct/Map) are cached but do not contribute min/max bounds, so they only
+record null counts and sizes.
+
+These statistics enable partition pruning when filtering:
+
+```scala
+val df = spark.range(10000000).cache()
+
+// This filter can skip batches using min/max statistics
+df.filter("id > 5000000").count()
+```
+
+## Memory Management
+
+Arrow cache uses off-heap memory managed by Apache Arrow allocators. This is a 
fundamental design choice in Apache Arrow and is not configurable for on-heap 
memory.

Review Comment:
   Fixed. Rewrote the Memory Management section: the durable cached payload is 
a heap `Array[Byte]` (the default `Dataset.cache()` level is the deserialized 
`MEMORY_AND_DISK`), and Arrow's off-heap allocators back only the transient 
encode/decode roots. The sizing guidance now points at executor heap for cache 
capacity and clarifies that `spark.executor.memoryOverhead` covers only the 
per-batch transient buffers, not the total cache size.



##########
docs/sql-arrow-cache-format.md:
##########
@@ -0,0 +1,343 @@
+# Apache Arrow Cache Format for Spark
+
+## Overview
+
+Apache Spark supports using Apache Arrow as an alternative cache format for 
in-memory Dataset caching. This format provides improved performance for 
certain workloads, especially when working with columnar data sources like 
Parquet and ORC.
+
+## Benefits
+
+The Arrow cache format offers several advantages over the default cache format:
+
+- **Zero-copy reads** when input is already in Arrow format (e.g., Arrow-based 
data sources, re-caching Arrow cached data)
+- **Better filter pushdown** with min/max statistics for partition pruning
+- **Off-heap memory management** via Arrow allocators
+- **Efficient compression** with zstd and lz4 codecs
+- **Arrow ecosystem interoperability** for data sharing
+
+**Note**: Spark's built-in Parquet/ORC readers use internal column vectors 
(`OnHeapColumnVector`/`OffHeapColumnVector`), not Arrow format, so they don't 
benefit from zero-copy optimization.
+
+## Configuration
+
+`spark.sql.cache.serializer` is a static SQL configuration, so it must be set 
when the
+SparkSession is built and cannot be changed on a running session 
(`spark.conf.set` rejects static
+keys with `CANNOT_MODIFY_CONFIG`):
+
+```scala
+val spark = SparkSession.builder()
+  .appName("MyApp")
+  .config("spark.sql.cache.serializer",
+    "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer")
+  .getOrCreate()
+```
+
+**Note**: This config selects the cache serializer for the whole session; once 
set, this
+serializer handles every cached relation. There is no automatic per-relation 
fallback to another
+cache serializer based on the data types involved (see
+[Supported Data Types](#supported-data-types) for how unsupported types are 
handled). The chosen
+serializer is also cached process-wide on first use, so switching cache 
formats within a JVM that
+has already materialized a cache requires a fresh JVM (see
+[Migration from Default Cache](#migration-from-default-cache)).
+
+## Usage
+
+Once configured, use cache operations as normal:
+
+```scala
+// Cache a DataFrame
+val df = spark.read.parquet("data.parquet")
+df.cache()
+
+// Use cached data
+df.filter("age > 30").count()
+
+// Uncache when done
+df.unpersist()
+```
+
+## Compression
+
+Arrow cache supports multiple compression codecs. Configure compression with:
+
+```scala
+spark.conf.set("spark.sql.execution.arrow.compression.codec", "zstd")
+```
+
+Available options:
+- `none` - No compression (fastest, largest size, **default**)
+- `lz4` - LZ4 compression (fast, good compression)
+- `zstd` - Zstandard compression (slower, best compression)
+
+For zstd, you can also configure the compression level. Positive values (up to 
22) give better
+compression but slower speed; negative values give ultra-fast compression with 
lower ratios:
+
+```scala
+spark.conf.set("spark.sql.execution.arrow.compression.zstd.level", "3")  // 
Default: 3
+```
+
+## Vectorized Reader
+
+Enable vectorized reading for better performance with primitive types:
+
+```scala
+spark.conf.set("spark.sql.inMemoryColumnarStorage.enableVectorizedReader", 
"true")
+```
+
+When enabled, cached data is read as columnar batches instead of rows, which 
can significantly improve performance for columnar operations.
+
+## Performance Characteristics
+
+In our benchmarks, the Arrow cache format performs best on the following 
workloads. Actual
+results depend on data types, compression settings, and hardware, and the 
default cache format
+can be faster in some cases (for example, with higher compression levels):
+
+1. **Filter-Heavy Workloads**: Queries with selective filters benefit from 
min/max statistics.
+2. **Columnar Operations**: Aggregations and projections on cached data 
benefit from the Arrow format.
+3. **Parquet/ORC Caching**: Arrow's batch processing helps even without the 
zero-copy path.
+4. **Re-caching with Column Projection**: Dropping columns from Arrow-cached 
data preserves the
+   `ArrowColumnVector` format, enabling true zero-copy extraction and the 
largest gains.
+
+### Benchmark Results
+
+The numbers below are illustrative results from one run on an Apple M4 Max 
(OpenJDK 21.0.8) and
+will vary with hardware, JDK, and compression settings. They are not a 
guarantee. For the
+authoritative, regularly regenerated numbers, see
+`sql/core/benchmarks/ArrowCacheBenchmark-jdk21-results.txt` and the 
`ArrowCacheBenchmark` suite.
+
+| Workload | Default Cache | Arrow Cache | Speedup |
+|----------|--------------|-------------|---------|
+| Write + Read (5M rows, 3 primitive columns) | 153.7 ns/row | 74.2 ns/row | 
**~2X faster** |
+| Cache then filter (5M rows) | 100.1 ns/row | 70.8 ns/row | **~1.4X faster** |
+| Columnar input from Parquet (2M rows, 3 primitive columns) | 195.3 ns/row | 
113.1 ns/row | **~1.7X faster** |
+| Re-cache with zero-copy (2M rows, 2 columns) | 123.3 ns/row | 38.5 ns/row | 
**~3.2X faster** |
+
+**Notes**:
+- **Write + Read**: Significant improvement from efficient Arrow serialization 
and vectorized operations
+- **Cache then filter**: This measures end-to-end cache build plus a filtered 
scan, comparing the two cache formats. Both formats collect min/max statistics 
and can prune batches, so the difference reflects overall cache+scan throughput 
rather than pruning unique to Arrow
+- **Parquet caching**: Shows improvement despite Spark's Parquet reader 
producing `OnHeapColumnVector`/`OffHeapColumnVector` rather than 
`ArrowColumnVector`, due to Arrow's efficient batch processing
+- **Re-cache with zero-copy**: When caching a subset of columns from 
Arrow-cached data (e.g., `df.drop("column")`), the remaining columns preserve 
their `ArrowColumnVector` format, enabling true zero-copy extraction and 
achieving the best performance
+- **Zero-copy benefits** only apply when input is already `ArrowColumnVector` 
(e.g., Python Arrow sources, re-caching Arrow cached data with column 
projection)
+
+## Supported Data Types
+
+Arrow cache supports the following data types:
+
+### Primitive Types
+- BooleanType
+- ByteType, ShortType, IntegerType, LongType
+- FloatType, DoubleType
+- DecimalType (all precision/scale combinations)
+- NullType
+
+### Temporal Types
+- DateType
+- TimestampType
+- TimestampNTZType
+- TimeType
+
+### Interval Types
+- YearMonthIntervalType
+- DayTimeIntervalType
+- CalendarIntervalType
+
+### String and Binary
+- StringType (including collated strings)
+- BinaryType
+
+### Complex Types
+- ArrayType
+- StructType
+- MapType
+- Nested combinations of the above
+
+### Other Types
+- VariantType
+- GeometryType, GeographyType
+- User-defined types (UDTs) whose underlying representation is itself supported
+
+### Unsupported Types
+
+Arrow cache covers every type the default cache serializer supports, plus some 
it
+does not (for example geometry and geography). Types that Arrow cannot 
represent
+(such as `ObjectType`) are not silently dropped or routed to a different cache
+serializer: there is no per-type fallback, because the cache serializer is 
chosen
+once via the static `spark.sql.cache.serializer` configuration and then handles
+every cached relation. Attempting to cache an unsupported type fails with an
+`UNSUPPORTED_DATATYPE` error when the cache is materialized.
+
+## Statistics and Filter Pushdown
+
+Arrow cache automatically collects min/max statistics for the following types:
+- Boolean
+- Numeric types (Byte, Short, Int, Long, Float, Double)
+- Decimal
+- Date, Timestamp, and Timestamp without time zone (TIMESTAMP_NTZ)
+- Time
+- Year-month and day-time intervals
+- String (using collation-aware comparison for collated strings)
+
+Other types (Binary, Variant, calendar intervals, and complex types such as
+Array/Struct/Map) are cached but do not contribute min/max bounds, so they only
+record null counts and sizes.
+
+These statistics enable partition pruning when filtering:
+
+```scala
+val df = spark.range(10000000).cache()
+
+// This filter can skip batches using min/max statistics
+df.filter("id > 5000000").count()
+```
+
+## Memory Management
+
+Arrow cache uses off-heap memory managed by Apache Arrow allocators. This is a 
fundamental design choice in Apache Arrow and is not configurable for on-heap 
memory.
+
+**Memory Efficiency**:
+- Despite requiring off-heap memory, Arrow cache is often **more 
memory-efficient** than default cache:
+  - Efficient compression with zstd/lz4 codecs
+  - Compact columnar format without Java object overhead
+  - Better compression ratios, especially for strings and complex types
+- If you have limited off-heap memory, increase 
`spark.executor.memoryOverhead` to allocate more off-heap memory
+
+**Memory Cleanup**:
+Arrow memory is automatically cleaned up when:
+- Tasks complete
+- DataFrames are unpersisted
+- SparkSession is stopped
+
+You can monitor Arrow memory usage through Spark metrics and the Spark UI.
+
+## Limitations and Considerations
+
+1. **Static Configuration**: Cache serializer must be set before SparkSession 
creation
+2. **Memory Overhead**: Arrow format has small per-batch overhead
+3. **Compatibility**: Cannot mix cache formats - recache needed when switching
+4. **Compression Trade-off**: Higher compression = lower memory but slower 
reads
+
+## Migration from Default Cache
+
+The cache serializer is resolved from `spark.sql.cache.serializer` only on 
first use and is then
+held in a process-wide field that is not reset when a SparkSession stops. As a 
result, **switching
+cache formats requires a fresh JVM** once any cache has been materialized -- 
stopping and
+rebuilding the SparkSession in the same process keeps using the originally 
resolved serializer.
+
+To migrate from the default cache to Arrow cache:
+
+1. **Start a new JVM / driver process** (a brand-new Spark application).
+2. **Build the SparkSession with the Arrow serializer**:
+   ```scala
+   val spark = SparkSession.builder()
+     .config("spark.sql.cache.serializer",
+       "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer")
+     .getOrCreate()
+   ```
+3. **Cache your DataFrames** as usual.
+
+**Note**: Cache data is never shared across formats; each application caches 
in whichever format
+its serializer produces.
+
+## Troubleshooting
+
+### Out of Memory Errors
+
+If you encounter OOM errors with Arrow cache:
+
+1. Reduce batch size:
+   ```scala
+   spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")  // 
Default: 10000
+   ```
+
+2. Enable compression:
+   ```scala
+   spark.conf.set("spark.sql.execution.arrow.compression.codec", "zstd")
+   ```
+
+3. Reduce compression level:
+   ```scala
+   spark.conf.set("spark.sql.execution.arrow.compression.zstd.level", "1")
+   ```
+
+### Slow Performance
+
+If Arrow cache is slower than expected:
+
+1. Enable vectorized reader:
+   ```scala
+   spark.conf.set("spark.sql.inMemoryColumnarStorage.enableVectorizedReader", 
"true")
+   ```
+
+2. Try different compression codec:
+   ```scala
+   spark.conf.set("spark.sql.execution.arrow.compression.codec", "lz4")  // 
Faster than zstd

Review Comment:
   Fixed. The troubleshooting section no longer recommends lz4 as 
faster/balanced. It now suggests lowering the zstd level or using `none` for 
read-heavy workloads, with an explicit note that lz4 should be avoided unless 
the native LZ4 library is on the classpath, since otherwise Arrow falls back to 
the much slower pure-Java Commons Compress implementation.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ArrowCachedBatchSerializer.scala:
##########
@@ -0,0 +1,1371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.columnar
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
+import java.nio.channels.Channels
+
+import scala.jdk.CollectionConverters._
+
+import org.apache.arrow.compression.{Lz4CompressionCodec, ZstdCompressionCodec}
+import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot, VectorUnloader}
+import org.apache.arrow.vector.compression.{CompressionCodec, 
NoCompressionCodec}
+import org.apache.arrow.vector.ipc.{ReadChannel, WriteChannel}
+import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, 
MessageSerializer}
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
+import org.apache.spark.sql.catalyst.types.DataTypeUtils
+import org.apache.spark.sql.columnar.{CachedBatch, 
SimpleMetricsCachedBatchSerializer}
+import org.apache.spark.sql.execution.arrow.ArrowWriter
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.util.ArrowUtils
+import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, 
ColumnVector}
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.Utils
+
+/**
+ * A [[CachedBatchSerializer]] that uses Apache Arrow as the cache format.
+ *
+ * This serializer:
+ *  - Supports both row-based (InternalRow) and columnar (ColumnarBatch) input
+ *  - Stores data in Arrow IPC streaming format with optional compression 
(zstd/lz4)
+ *  - Enables zero-copy columnar reads when output is ColumnarBatch
+ *  - Uses off-heap memory via Arrow allocators
+ *  - Collects per-column statistics for partition pruning
+ *  - Provides efficient interoperability with Arrow ecosystem
+ *
+ * Configuration options:
+ *  - spark.sql.cache.serializer: Set to this class name to enable
+ *  - spark.sql.execution.arrow.maxRecordsPerBatch: Max rows per cached batch
+ *  - spark.sql.execution.arrow.compression.codec: Compression (none/zstd/lz4)
+ *  - spark.sql.inMemoryColumnarStorage.enableVectorizedReader: Enable 
columnar output
+ */
+class ArrowCachedBatchSerializer extends SimpleMetricsCachedBatchSerializer {
+
+  override def supportsColumnarInput(schema: Seq[Attribute]): Boolean = {
+    // Check if all data types in the schema are supported by Arrow
+    schema.forall(attr => ArrowUtils.isSupportedByArrow(attr.dataType))
+  }
+
+  override def convertInternalRowToCachedBatch(
+      input: RDD[InternalRow],
+      schema: Seq[Attribute],
+      storageLevel: StorageLevel,
+      conf: SQLConf): RDD[CachedBatch] = {
+    // Capture config values on driver before RDD transformation
+    val sparkSchema = DataTypeUtils.fromAttributes(schema)
+    val maxRecordsPerBatch = conf.arrowMaxRecordsPerBatch
+    val timeZoneId = conf.sessionLocalTimeZone
+    val compressionCodecName = conf.arrowCompressionCodec
+    val compressionLevel = conf.arrowZstdCompressionLevel
+
+    input.mapPartitionsInternal { rowIterator =>
+      new InternalRowToArrowCachedBatchIterator(
+        rowIterator,
+        schema,
+        sparkSchema,
+        maxRecordsPerBatch,
+        timeZoneId,
+        compressionCodecName,
+        compressionLevel)
+    }
+  }
+
+  override def convertColumnarBatchToCachedBatch(
+      input: RDD[ColumnarBatch],
+      schema: Seq[Attribute],
+      storageLevel: StorageLevel,
+      conf: SQLConf): RDD[CachedBatch] = {
+    // Capture config values on driver before RDD transformation
+    val sparkSchema = DataTypeUtils.fromAttributes(schema)
+    val timeZoneId = conf.sessionLocalTimeZone
+    val compressionCodecName = conf.arrowCompressionCodec
+    val compressionLevel = conf.arrowZstdCompressionLevel
+
+    input.mapPartitionsInternal { batchIterator =>
+      new ColumnarBatchToArrowCachedBatchIterator(
+        batchIterator,
+        schema,
+        sparkSchema,
+        timeZoneId,
+        compressionCodecName,
+        compressionLevel)
+    }
+  }
+
+  override def supportsColumnarOutput(schema: StructType): Boolean = {
+    // Always support columnar output with Arrow
+    true
+  }
+
+  override def vectorTypes(attributes: Seq[Attribute], conf: SQLConf): 
Option[Seq[String]] = {
+    Option(Seq.fill(attributes.length)(classOf[ArrowColumnVector].getName))
+  }
+
+  override def convertCachedBatchToColumnarBatch(
+      input: RDD[CachedBatch],
+      cacheAttributes: Seq[Attribute],
+      selectedAttributes: Seq[Attribute],
+      conf: SQLConf): RDD[ColumnarBatch] = {
+    val cacheSchema = DataTypeUtils.fromAttributes(cacheAttributes)
+    val selectedSchema = DataTypeUtils.fromAttributes(selectedAttributes)
+    val columnIndices =
+      selectedAttributes.map(a => cacheAttributes.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+    // Capture config on driver
+    val timeZoneId = conf.sessionLocalTimeZone
+    val prefetchEnabled = conf.arrowCachePrefetchEnabled
+
+    input.mapPartitionsInternal { batchIterator =>
+      new ArrowCachedBatchToColumnarBatchIterator(
+        batchIterator,
+        cacheSchema,
+        selectedSchema,
+        columnIndices,
+        timeZoneId,
+        prefetchEnabled)
+    }
+  }
+
+  override def convertCachedBatchToInternalRow(
+      input: RDD[CachedBatch],
+      cacheAttributes: Seq[Attribute],
+      selectedAttributes: Seq[Attribute],
+      conf: SQLConf): RDD[InternalRow] = {
+    val cacheSchema = DataTypeUtils.fromAttributes(cacheAttributes)
+    val selectedSchema = DataTypeUtils.fromAttributes(selectedAttributes)
+    val timeZoneId = conf.sessionLocalTimeZone
+
+    // Calculate column indices for projection
+    val selectedIndices = selectedAttributes.map { attr =>
+      cacheAttributes.indexWhere(_.exprId == attr.exprId)
+    }.toArray
+
+    // Check if all selected types can use the fast path.
+    // Types not handled by ArrowColumnReader must use the fallback path.
+    val needsFallback = selectedSchema.fields.exists { f =>
+      f.dataType match {
+        case _: ArrayType | _: StructType | _: MapType => true
+        case CalendarIntervalType | VariantType | NullType => true
+        case _: UserDefinedType[_] => true
+        // Geometry/Geography are represented as an Arrow struct (srid + wkb); 
the fast-path
+        // ArrowColumnReader does not handle them, so route them through the 
fallback.
+        case _: GeometryType | _: GeographyType => true
+        case _ => false
+      }
+    }
+
+    if (needsFallback) {
+      // Fall back to columnar-to-row conversion via ColumnarBatch for complex 
types.
+      // Use UnsafeProjection to convert ColumnarBatchRow to UnsafeRow.
+      convertCachedBatchToColumnarBatch(input, cacheAttributes, 
selectedAttributes, conf)
+        .mapPartitionsInternal { batchIter =>
+          val toUnsafe = 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection.create(
+            selectedSchema)
+          batchIter.flatMap { batch =>
+            val numRows = batch.numRows()
+            new Iterator[InternalRow] {
+              private var rowIdx = 0
+              override def hasNext: Boolean = rowIdx < numRows
+              override def next(): InternalRow = {
+                val row = batch.getRow(rowIdx)
+                rowIdx += 1
+                toUnsafe(row)
+              }
+            }
+          }
+        }
+    } else {
+      val prefetchEnabled = conf.arrowCachePrefetchEnabled
+      input.mapPartitionsInternal { batchIterator =>
+        new ArrowCachedBatchToInternalRowIterator(
+          batchIterator,
+          cacheSchema,
+          selectedSchema,
+          selectedIndices,
+          timeZoneId,
+          prefetchEnabled)
+      }
+    }
+  }
+}
+
+/**
+ * Companion object with shared utility methods for Arrow cache serialization.
+ */
+private object ArrowCachedBatchSerializer {
+
+  // scalastyle:off caselocale
+  def createCompressionCodec(
+      codecName: String,
+      compressionLevel: Int): CompressionCodec = {
+    codecName.toLowerCase match {
+      case "none" => NoCompressionCodec.INSTANCE
+      // The codec instance must be constructed directly so that 
compressionLevel is honored:
+      // CompressionCodec.Factory.createCodec(codecType) ignores the level and 
builds a codec at
+      // the default level. The level only matters on the write side; the read 
side looks up the
+      // codec by the type recorded in the IPC message.
+      case "zstd" => new ZstdCompressionCodec(compressionLevel)
+      case "lz4" => new Lz4CompressionCodec()
+      case other =>
+        throw SparkException.internalError(
+          s"Unsupported Arrow compression codec: $other. Supported values: 
none, zstd, lz4")
+    }
+  }
+  // scalastyle:on caselocale
+
+  def serializeBatch(batch: ArrowRecordBatch): Array[Byte] = {
+    val out = new ByteArrayOutputStream()
+    val writeChannel = new WriteChannel(Channels.newChannel(out))
+    MessageSerializer.serialize(writeChannel, batch)
+    out.toByteArray
+  }
+
+  def createColumnStats(dataType: DataType): ColumnStats = {
+    dataType match {
+      case BooleanType => new BooleanColumnStats
+      case ByteType => new ByteColumnStats
+      case ShortType => new ShortColumnStats
+      case IntegerType => new IntColumnStats
+      case DateType => new IntColumnStats  // Date is stored as Int
+      case LongType => new LongColumnStats
+      case TimestampType => new LongColumnStats  // Timestamp is stored as Long
+      case TimestampNTZType => new LongColumnStats  // TimestampNTZ is stored 
as Long
+      case FloatType => new FloatColumnStats
+      case DoubleType => new DoubleColumnStats
+      case st: StringType => new StringColumnStats(st)
+      case BinaryType => new BinaryColumnStats
+      case dt: DecimalType => new DecimalColumnStats(dt)
+      case CalendarIntervalType => new IntervalColumnStats
+      case _: YearMonthIntervalType => new IntColumnStats   // stored as Int
+      case _: DayTimeIntervalType => new LongColumnStats  // stored as Long
+      case _: TimeType => new LongColumnStats  // Time is stored as Long 
(nanoseconds)
+      case VariantType => new VariantColumnStats
+      // Geometry/Geography are stored as binary (WKB) internally, so reuse 
BinaryColumnStats
+      // to collect size/count without min/max bounds. They are AtomicTypes 
that ColumnType
+      // (used by ObjectColumnStats) does not handle, so they must be matched 
explicitly here.
+      case _: GeometryType | _: GeographyType => new BinaryColumnStats
+      case _ => new ObjectColumnStats(dataType)
+    }
+  }
+
+  def buildStatisticsFromCollectors(
+      collectors: Array[ColumnStats],
+      schema: Seq[Attribute]): InternalRow = {
+    val stats = collectors.flatMap { collector =>
+      val collected = collector.collectedStatistics
+      // ColumnStats returns: [lowerBound, upperBound, nullCount, count, 
sizeInBytes]
+      Seq(collected(0), collected(1), collected(2), collected(3), collected(4))
+    }
+    InternalRow.fromSeq(stats.toSeq)
+  }
+
+  def collectStatistics(
+      root: VectorSchemaRoot,
+      schema: Seq[Attribute]): InternalRow = {
+    val rowCount = root.getRowCount
+    val vectors = root.getFieldVectors.asScala.toSeq
+
+    // Collect stats for each column: lowerBound, upperBound, nullCount, 
rowCount, sizeInBytes
+    val stats = schema.zip(vectors).flatMap { case (attr, vector) =>
+      val nullCount = (0 until rowCount).count(i => vector.isNull(i))
+      val sizeInBytes = vector.getBufferSize.toLong
+
+      val (lower, upper) = attr.dataType match {
+        case BooleanType => calculateMinMaxBoolean(vector, rowCount)
+        case ByteType => calculateMinMaxByte(vector, rowCount)
+        case ShortType => calculateMinMaxShort(vector, rowCount)
+        case IntegerType => calculateMinMaxInt(vector, rowCount)
+        case DateType => calculateMinMaxDate(vector, rowCount)
+        case LongType => calculateMinMaxLong(vector, rowCount)
+        case TimestampType => calculateMinMaxTimestamp(vector, rowCount)
+        case TimestampNTZType => calculateMinMaxTimestampNTZ(vector, rowCount)
+        case FloatType => calculateMinMaxFloat(vector, rowCount)
+        case DoubleType => calculateMinMaxDouble(vector, rowCount)
+        case st: StringType => calculateMinMaxString(vector, rowCount, 
st.collationId)
+        case _: DecimalType => calculateMinMaxDecimal(vector, rowCount, 
attr.dataType)
+        case _: YearMonthIntervalType => 
calculateMinMaxYearMonthInterval(vector, rowCount)
+        case _: DayTimeIntervalType => calculateMinMaxDayTimeInterval(vector, 
rowCount)
+        case _: TimeType => calculateMinMaxTime(vector, rowCount)
+        case _ => (null, null) // Skip for binary, complex, and other 
unsupported types
+      }
+
+      Seq(lower, upper, nullCount, rowCount, sizeInBytes)
+    }
+
+    new 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow(stats.toArray)
+  }
+
+  def calculateMinMaxBoolean(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = true
+    var max = false
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.BitVector].get(i) != 0
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxByte(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Byte.MaxValue
+    var max = Byte.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.TinyIntVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxShort(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Short.MaxValue
+    var max = Short.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.SmallIntVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxInt(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Int.MaxValue
+    var max = Int.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.IntVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxDate(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Int.MaxValue
+    var max = Int.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.DateDayVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxLong(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Long.MaxValue
+    var max = Long.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.BigIntVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxTimestamp(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Long.MaxValue
+    var max = Long.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value =
+          
vector.asInstanceOf[org.apache.arrow.vector.TimeStampMicroTZVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxTimestampNTZ(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Long.MaxValue
+    var max = Long.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value =
+          
vector.asInstanceOf[org.apache.arrow.vector.TimeStampMicroVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxFloat(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Float.MaxValue
+    var max = Float.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.Float4Vector].get(i)
+        // Skip NaN: IEEE 754 comparisons with NaN are always false, so NaN 
never
+        // updates min/max in the row-based path 
(FloatColumnStats.gatherValueStats).
+        if (!value.isNaN) {
+          if (!hasValue) {
+            min = value
+            max = value
+            hasValue = true
+          } else {
+            if (value < min) min = value
+            if (value > max) max = value
+          }
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxDouble(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Double.MaxValue
+    var max = Double.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.Float8Vector].get(i)
+        // Skip NaN to match DoubleColumnStats.gatherValueStats.
+        if (!value.isNaN) {
+          if (!hasValue) {
+            min = value
+            max = value
+            hasValue = true
+          } else {
+            if (value < min) min = value
+            if (value > max) max = value
+          }
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxString(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int,
+      collationId: Int = StringType.collationId): (Any, Any) = {
+    var min: org.apache.spark.unsafe.types.UTF8String = null
+    var max: org.apache.spark.unsafe.types.UTF8String = null
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val bytes = 
vector.asInstanceOf[org.apache.arrow.vector.VarCharVector].get(i)
+        val value = org.apache.spark.unsafe.types.UTF8String.fromBytes(bytes)
+        if (!hasValue) {
+          min = value.clone()
+          max = value.clone()
+          hasValue = true
+        } else {
+          if (value.semanticCompare(min, collationId) < 0) min = value.clone()
+          if (value.semanticCompare(max, collationId) > 0) max = value.clone()
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxDecimal(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int,
+      dataType: org.apache.spark.sql.types.DataType): (Any, Any) = {
+    val decimalType = dataType.asInstanceOf[DecimalType]
+    var min: org.apache.spark.sql.types.Decimal = null
+    var max: org.apache.spark.sql.types.Decimal = null
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val bigDecimal = vector.asInstanceOf[
+          org.apache.arrow.vector.DecimalVector].getObject(i)
+        val value = org.apache.spark.sql.types.Decimal(
+          bigDecimal, decimalType.precision, decimalType.scale)
+
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value.compareTo(min) < 0) min = value
+          if (value.compareTo(max) > 0) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxYearMonthInterval(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Int.MaxValue
+    var max = Int.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.IntervalYearVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxDayTimeInterval(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Long.MaxValue
+    var max = Long.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = org.apache.arrow.vector.DurationVector.get(
+          
vector.asInstanceOf[org.apache.arrow.vector.DurationVector].getDataBuffer, i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+
+  def calculateMinMaxTime(
+      vector: org.apache.arrow.vector.FieldVector,
+      rowCount: Int): (Any, Any) = {
+    var min = Long.MaxValue
+    var max = Long.MinValue
+    var hasValue = false
+
+    (0 until rowCount).foreach { i =>
+      if (!vector.isNull(i)) {
+        val value = 
vector.asInstanceOf[org.apache.arrow.vector.TimeNanoVector].get(i)
+        if (!hasValue) {
+          min = value
+          max = value
+          hasValue = true
+        } else {
+          if (value < min) min = value
+          if (value > max) max = value
+        }
+      }
+    }
+
+    if (hasValue) (min, max) else (null, null)
+  }
+}
+
+/**
+ * Iterator that converts InternalRow to ArrowCachedBatch.
+ */
+private class InternalRowToArrowCachedBatchIterator(
+    rowIter: Iterator[InternalRow],
+    schema: Seq[Attribute],
+    sparkSchema: StructType,
+    maxRecordsPerBatch: Long,
+    timeZoneId: String,
+    compressionCodecName: String,
+    compressionLevel: Int) extends Iterator[ArrowCachedBatch] {
+
+  private val compressionCodec = 
ArrowCachedBatchSerializer.createCompressionCodec(
+    compressionCodecName,
+    compressionLevel)
+
+  private val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+    
s"InternalRowToArrowCachedBatchIterator-${TaskContext.get().taskAttemptId()}",
+    0,
+    Long.MaxValue)
+
+  private val arrowSchema = ArrowUtils.toArrowSchema(sparkSchema, timeZoneId, 
false, false)
+  private val root = VectorSchemaRoot.create(arrowSchema, allocator)
+  private val arrowWriter = ArrowWriter.create(root)
+  private val unloader = new VectorUnloader(root, true, compressionCodec, true)
+
+  // Create statistics collectors for each column
+  private val statsCollectors: Array[ColumnStats] = schema.map { attr =>
+    ArrowCachedBatchSerializer.createColumnStats(attr.dataType)
+  }.toArray
+
+  // Register cleanup
+  Option(TaskContext.get()).foreach { tc =>
+    tc.addTaskCompletionListener[Unit] { _ =>
+      close()
+    }
+  }
+
+  override def hasNext: Boolean = rowIter.hasNext || {
+    close()
+    false
+  }
+
+  override def next(): ArrowCachedBatch = {
+    var rowCount = 0
+
+    // Reset statistics collectors for new batch
+    var idx = 0
+    while (idx < statsCollectors.length) {
+      statsCollectors(idx) = 
ArrowCachedBatchSerializer.createColumnStats(schema(idx).dataType)
+      idx += 1
+    }
+
+    Utils.tryWithSafeFinally {
+      // Write rows to Arrow vectors and collect statistics incrementally.
+      // A nonpositive maxRecordsPerBatch means unlimited (one batch per 
partition), matching
+      // ArrowConverters; without the `<= 0` guard the loop would emit empty 
batches forever.
+      while (rowIter.hasNext && (maxRecordsPerBatch <= 0 || rowCount < 
maxRecordsPerBatch)) {
+        val row = rowIter.next()
+        arrowWriter.write(row)
+
+        // Collect statistics for this row
+        var i = 0
+        while (i < statsCollectors.length) {
+          statsCollectors(i).gatherStats(row, i)
+          i += 1
+        }
+
+        rowCount += 1
+      }
+      arrowWriter.finish()
+
+      // Get the Arrow RecordBatch with compression
+      val recordBatch = unloader.getRecordBatch()
+
+      Utils.tryWithSafeFinally {
+        // Serialize to Arrow IPC format
+        val arrowData = ArrowCachedBatchSerializer.serializeBatch(recordBatch)
+
+        // Build statistics InternalRow from collected stats
+        val stats = ArrowCachedBatchSerializer.buildStatisticsFromCollectors(
+          statsCollectors, schema)
+
+        ArrowCachedBatch(rowCount, arrowData, stats)
+      } {
+        recordBatch.close()
+      }
+    } {
+      arrowWriter.reset()
+    }
+  }
+
+  private def close(): Unit = {
+    root.close()
+    allocator.close()
+  }
+}
+
+/**
+ * Iterator that converts ColumnarBatch to ArrowCachedBatch.
+ */
+private class ColumnarBatchToArrowCachedBatchIterator(
+    batchIter: Iterator[ColumnarBatch],
+    schema: Seq[Attribute],
+    sparkSchema: StructType,
+    timeZoneId: String,
+    compressionCodecName: String,
+    compressionLevel: Int) extends Iterator[ArrowCachedBatch] {
+
+  private val compressionCodec = 
ArrowCachedBatchSerializer.createCompressionCodec(
+    compressionCodecName,
+    compressionLevel)
+
+  private val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+    
s"ColumnarBatchToArrowCachedBatchIterator-${TaskContext.get().taskAttemptId()}",
+    0,
+    Long.MaxValue)
+
+  private val arrowSchema = ArrowUtils.toArrowSchema(sparkSchema, timeZoneId, 
false, false)
+
+  // Register cleanup
+  Option(TaskContext.get()).foreach { tc =>
+    tc.addTaskCompletionListener[Unit] { _ =>
+      allocator.close()
+    }
+  }
+
+  override def hasNext: Boolean = batchIter.hasNext
+
+  override def next(): ArrowCachedBatch = {
+    val batch = batchIter.next()
+    val rowCount = batch.numRows()
+
+    // Check if batch is already Arrow-based for zero-copy path. The zero-copy 
path reuses the
+    // input vectors but serializes them under a schema built with 
largeVarTypes=false, and the
+    // read path reconstructs that same non-large schema. Large var-width 
vectors use 64-bit
+    // offsets, so reading them back under a 32-bit-offset schema would 
silently corrupt data.
+    // Fall back to the row-based conversion (which always produces standard 
var-width vectors)
+    // whenever any input vector is, or nests, a large var-width vector.
+    val vectors = (0 until batch.numCols()).map(batch.column)
+    val zeroCopyEligible = vectors.forall {
+      case acv: ArrowColumnVector =>
+        
!ColumnarBatchToArrowCachedBatchIterator.containsLargeVarType(acv.getValueVector)
+      case _ => false
+    }
+    if (zeroCopyEligible) {
+      // Fast path: zero-copy extraction of Arrow RecordBatch
+      convertArrowBatchZeroCopy(batch, rowCount, schema, vectors)
+    } else {
+      // Slow path: convert to Arrow via rows
+      convertToArrowBatch(batch, rowCount, schema)
+    }
+  }
+
+  private def convertArrowBatchZeroCopy(
+      batch: ColumnarBatch,
+      rowCount: Int,
+      schema: Seq[Attribute],
+      vectors: Seq[ColumnVector]): ArrowCachedBatch = {
+    // Zero-copy path: extract Arrow vectors directly from ArrowColumnVector
+    val arrowVectors = vectors.map(
+      _.asInstanceOf[ArrowColumnVector].getValueVector.asInstanceOf[
+        org.apache.arrow.vector.FieldVector])
+
+    // Create a VectorSchemaRoot from the existing vectors
+    val root = new VectorSchemaRoot(arrowSchema, arrowVectors.asJava, rowCount)
+
+    Utils.tryWithSafeFinally {
+      // Use VectorUnloader to create compressed RecordBatch
+      val unloader = new VectorUnloader(root, true, compressionCodec, true)
+      val recordBatch = unloader.getRecordBatch()
+
+      Utils.tryWithSafeFinally {
+        val arrowData = ArrowCachedBatchSerializer.serializeBatch(recordBatch)
+        val stats = ArrowCachedBatchSerializer.collectStatistics(root, schema)
+        ArrowCachedBatch(rowCount, arrowData, stats)
+      } {
+        recordBatch.close()
+      }
+    } {
+      // Note: We don't close the root here because we don't own the vectors
+      // They are owned by the input ColumnarBatch
+    }
+  }
+
+  private def convertToArrowBatch(
+      batch: ColumnarBatch,
+      rowCount: Int,
+      schema: Seq[Attribute]): ArrowCachedBatch = {
+    // Convert columnar batch to rows, then to Arrow
+    val root = VectorSchemaRoot.create(arrowSchema, allocator)
+    val arrowWriter = ArrowWriter.create(root)
+    val unloader = new VectorUnloader(root, true, compressionCodec, true)
+
+    // Collect statistics inline during row iteration, same as 
InternalRowToArrow path
+    val statsCollectors: Array[ColumnStats] = schema.map { attr =>
+      ArrowCachedBatchSerializer.createColumnStats(attr.dataType)
+    }.toArray
+
+    Utils.tryWithSafeFinally {
+      val rowIterator = batch.rowIterator().asScala
+      while (rowIterator.hasNext) {
+        val row = rowIterator.next()
+        arrowWriter.write(row)
+
+        // Collect statistics for this row inline
+        var i = 0
+        while (i < statsCollectors.length) {
+          statsCollectors(i).gatherStats(row, i)
+          i += 1
+        }
+      }
+      arrowWriter.finish()
+
+      val recordBatch = unloader.getRecordBatch()
+      Utils.tryWithSafeFinally {
+        val arrowData = ArrowCachedBatchSerializer.serializeBatch(recordBatch)
+        val stats = ArrowCachedBatchSerializer.buildStatisticsFromCollectors(
+          statsCollectors, schema)
+        ArrowCachedBatch(rowCount, arrowData, stats)
+      } {
+        recordBatch.close()
+      }
+    } {
+      arrowWriter.reset()
+      root.close()
+    }
+  }
+}
+
+private object ColumnarBatchToArrowCachedBatchIterator {
+  import org.apache.arrow.vector.{FieldVector, LargeVarBinaryVector, 
LargeVarCharVector}
+
+  /**
+   * Whether the vector is, or nests, a large var-width vector (64-bit 
offsets). These are not
+   * eligible for the zero-copy path because that path serializes and reloads 
under a schema built
+   * with largeVarTypes=false; reinterpreting 64-bit offset buffers as 32-bit 
would corrupt data.
+   */
+  def containsLargeVarType(vector: org.apache.arrow.vector.ValueVector): 
Boolean = vector match {
+    case _: LargeVarCharVector | _: LargeVarBinaryVector => true
+    case fv: FieldVector =>
+      fv.getChildrenFromFields.asScala.exists(containsLargeVarType)
+    case _ => false
+  }
+}
+
+/**
+ * Iterator that converts ArrowCachedBatch to ColumnarBatch.
+ */
+private class ArrowCachedBatchToColumnarBatchIterator(
+    batchIter: Iterator[CachedBatch],
+    cacheSchema: StructType,
+    selectedSchema: StructType,
+    columnIndices: Array[Int],
+    timeZoneId: String,
+    prefetchEnabled: Boolean = false) extends Iterator[ColumnarBatch] {
+
+  import java.util.concurrent.{Callable, ExecutionException, Executors, 
ExecutorService, Future}
+
+  private val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+    
s"ArrowCachedBatchToColumnarBatchIterator-${TaskContext.get().taskAttemptId()}",
+    0,
+    Long.MaxValue)
+
+  private val arrowSchema = ArrowUtils.toArrowSchema(cacheSchema, timeZoneId, 
false, false)
+
+  // Track only the previous root to close it when next batch is produced
+  private var previousRoot: VectorSchemaRoot = null
+
+  // Prefetch support: deserialize the next batch into its own root in a 
background thread while
+  // the current batch is being consumed. Only the deserialization (IPC read + 
decompression +
+  // loading into a fresh root) happens off-thread; closing the previous root 
stays on the
+  // consumer thread in next(), so the vectors backing a returned 
ColumnarBatch are never released
+  // while the consumer may still read them.
+  private val prefetchExecutor: ExecutorService = if (prefetchEnabled) {
+    Executors.newSingleThreadExecutor(r => {
+      val t = new Thread(r, "arrow-cache-prefetch")
+      t.setDaemon(true)
+      t
+    })
+  } else {
+    null
+  }
+  private var prefetchFuture: Future[VectorSchemaRoot] = _
+
+  // Register cleanup - close remaining root and allocator when task completes
+  Option(TaskContext.get()).foreach { tc =>
+    tc.addTaskCompletionListener[Unit] { _ =>
+      if (prefetchFuture != null) {
+        prefetchFuture.cancel(true)

Review Comment:
   Fixed. `drainAndClosePrefetch` now clears the interrupt for the duration, 
loops `awaitTermination` until the worker actually terminates (re-clearing if 
interrupted again), then retrieves and closes any produced root, and only 
restores the interrupt at the very end. So a completion listener running on an 
already-interrupted (killed) task still joins the worker and closes the root 
before the allocator is closed. Added a test that calls it from an interrupted 
thread with a prefetch in flight and asserts the produced root is closed (no 
leak on `allocator.close()`) and the interrupt is restored. Applied to both the 
row and columnar readers via the shared helper.



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ArrowCachedBatchSerializer.scala:
##########
@@ -0,0 +1,1371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.columnar
+
+import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
+import java.nio.channels.Channels
+
+import scala.jdk.CollectionConverters._
+
+import org.apache.arrow.compression.{Lz4CompressionCodec, ZstdCompressionCodec}
+import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot, VectorUnloader}
+import org.apache.arrow.vector.compression.{CompressionCodec, 
NoCompressionCodec}
+import org.apache.arrow.vector.ipc.{ReadChannel, WriteChannel}
+import org.apache.arrow.vector.ipc.message.{ArrowRecordBatch, 
MessageSerializer}
+
+import org.apache.spark.{SparkException, TaskContext}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
+import org.apache.spark.sql.catalyst.types.DataTypeUtils
+import org.apache.spark.sql.columnar.{CachedBatch, 
SimpleMetricsCachedBatchSerializer}
+import org.apache.spark.sql.execution.arrow.ArrowWriter
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.util.ArrowUtils
+import org.apache.spark.sql.vectorized.{ArrowColumnVector, ColumnarBatch, 
ColumnVector}
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.unsafe.types.UTF8String
+import org.apache.spark.util.Utils
+
+/**
+ * A [[CachedBatchSerializer]] that uses Apache Arrow as the cache format.
+ *
+ * This serializer:
+ *  - Supports both row-based (InternalRow) and columnar (ColumnarBatch) input
+ *  - Stores data in Arrow IPC streaming format with optional compression 
(zstd/lz4)
+ *  - Enables zero-copy columnar reads when output is ColumnarBatch
+ *  - Uses off-heap memory via Arrow allocators
+ *  - Collects per-column statistics for partition pruning
+ *  - Provides efficient interoperability with Arrow ecosystem
+ *
+ * Configuration options:
+ *  - spark.sql.cache.serializer: Set to this class name to enable
+ *  - spark.sql.execution.arrow.maxRecordsPerBatch: Max rows per cached batch
+ *  - spark.sql.execution.arrow.compression.codec: Compression (none/zstd/lz4)
+ *  - spark.sql.inMemoryColumnarStorage.enableVectorizedReader: Enable 
columnar output
+ */
+class ArrowCachedBatchSerializer extends SimpleMetricsCachedBatchSerializer {
+
+  override def supportsColumnarInput(schema: Seq[Attribute]): Boolean = {
+    // Check if all data types in the schema are supported by Arrow
+    schema.forall(attr => ArrowUtils.isSupportedByArrow(attr.dataType))
+  }
+
+  override def convertInternalRowToCachedBatch(
+      input: RDD[InternalRow],
+      schema: Seq[Attribute],
+      storageLevel: StorageLevel,
+      conf: SQLConf): RDD[CachedBatch] = {
+    // Capture config values on driver before RDD transformation
+    val sparkSchema = DataTypeUtils.fromAttributes(schema)
+    val maxRecordsPerBatch = conf.arrowMaxRecordsPerBatch

Review Comment:
   Split into the two halves. The byte-estimate half is fixed: the row path now 
measures `arrowWriter.sizeInBytes()` (the actual bytes written to the Arrow 
vectors) instead of `numFields * 16`, so a large value in a 
`GenericInternalRow` is accounted for correctly. Added a test that a tiny 
`maxBytesPerBatch` forces multiple batches.\n\nOn splitting columnar input: I'd 
prefer not to add columnar slicing for the recache scenario. The upstream 
`ColumnarBatch` row count is already bounded by the producing source's 
batch-size config, and batch size is a performance knob rather than a 
correctness one -- a recached relation that keeps its original batch sizing 
produces identical results. The repro requires lowering `maxRecordsPerBatch` 
specifically between caching A and recaching a projection of A, which doesn't 
reduce memory pressure (the upstream batches are already materialized) and I 
couldn't find a realistic workload for it. Happy to revisit if you have one in 
mind.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ArrowCacheBenchmark.scala:
##########
@@ -0,0 +1,805 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.benchmark
+
+import org.apache.spark.benchmark.Benchmark
+import org.apache.spark.internal.config.UI.UI_ENABLED
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer
+import org.apache.spark.sql.internal.{SQLConf, StaticSQLConf}
+
+/**
+ * Benchmark to measure cache performance with Arrow format vs Default format.
+ *
+ * To run this benchmark:
+ * {{{
+ *   1. without sbt: bin/spark-submit --class <this class>
+ *     --jars <spark core test jar>,<spark catalyst test jar> <spark sql test 
jar>
+ *   2. build/sbt "sql/Test/runMain <this class>"
+ *   3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt 
"sql/Test/runMain <this class>"
+ *      Results will be written to 
"benchmarks/ArrowCacheBenchmark-results.txt".
+ * }}}
+ */
+object ArrowCacheBenchmark extends SqlBasedBenchmark {
+
+  // Do NOT access the inherited `spark` session - it uses default serializer
+  // Instead, create fresh sessions for each benchmark
+
+  // Create separate sessions for each cache format since 
SPARK_CACHE_SERIALIZER is static
+  // CRITICAL: Can only have one active SparkContext at a time
+  private def createFreshSession(serializer: String): SparkSession = {
+    // Stop any existing session and clear the registry
+    SparkSession.getActiveSession.foreach(_.stop())
+    SparkSession.clearActiveSession()
+    SparkSession.clearDefaultSession()
+
+    // CRITICAL: Clear the cached serializer instance in InMemoryRelation
+    // This singleton is stored statically and persists across sessions
+    org.apache.spark.sql.execution.columnar.InMemoryRelation.clearSerializer()
+
+    SparkSession.builder()
+      .master("local[1]")
+      .appName(s"ArrowCacheBenchmark-$serializer")
+      .config(SQLConf.SHUFFLE_PARTITIONS.key, 1)
+      .config(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key, 1)
+      .config(UI_ENABLED.key, false)
+      .config(StaticSQLConf.SPARK_CACHE_SERIALIZER.key, serializer)
+      .getOrCreate()
+  }
+
+  private def cachePrimitiveTypes(): Unit = {
+    val numRows = 5000000 // 5M rows for faster benchmarking
+    runBenchmark("Cache primitive types") {
+      val benchmark = new Benchmark("Cache 5M rows with primitives", numRows, 
output = output)
+
+      // Run Default cache benchmark (with compression - default)
+      benchmark.addCase("Default cache - write + read") { _ =>
+        val spark = createFreshSession(
+          
"org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer")
+        try {
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "id * 2L as long_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      // Run Default cache without compression
+      benchmark.addCase("Default cache - write + read (uncompressed)") { _ =>
+        val spark = createFreshSession(
+          
"org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer")
+        try {
+          spark.conf.set("spark.sql.inMemoryColumnarStorage.compressed", 
"false")
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "id * 2L as long_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      // Run Arrow cache benchmark
+      benchmark.addCase("Arrow cache - write + read") { _ =>
+        val spark = 
createFreshSession(classOf[ArrowCachedBatchSerializer].getName)
+        try {
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "id * 2L as long_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      // NOTE: LZ4 compression benchmarks are commented out because Arrow's 
LZ4 implementation
+      // requires the optional lz4-java native library dependency. Without it, 
Arrow falls back
+      // to Apache Commons Compress pure-Java LZ4 implementation which is 
extremely slow
+      // (~50x slower than zstd). To enable fast LZ4 benchmarks, add this 
dependency to pom.xml:
+      //   <dependency>
+      //     <groupId>org.lz4</groupId>
+      //     <artifactId>lz4-java</artifactId>
+      //     <version>1.8.0</version>
+      //   </dependency>
+
+      // // Run Arrow cache with lz4 compression benchmark
+      // benchmark.addCase("Arrow cache - write + read (lz4)") { _ =>
+      //   val spark = 
createFreshSession(classOf[ArrowCachedBatchSerializer].getName)
+      //   try {
+      //     spark.conf.set(SQLConf.ARROW_EXECUTION_COMPRESSION_CODEC.key, 
"lz4")
+      //     val df = spark.range(numRows).selectExpr(
+      //       "id as int_col",
+      //       "id * 2L as long_col",
+      //       "cast(id as double) as double_col"
+      //     )
+      //     df.cache()
+      //     df.write.format("noop").mode("overwrite").save()
+      //     df.unpersist(blocking = true)
+      //   } finally {
+      //     spark.stop()
+      //   }
+      // }
+
+      // Run Arrow cache with zstd level -1 (fastest) compression benchmark
+      benchmark.addCase("Arrow cache - write + read (zstd level -1)") { _ =>
+        val spark = 
createFreshSession(classOf[ArrowCachedBatchSerializer].getName)
+        try {
+          spark.conf.set(SQLConf.ARROW_EXECUTION_COMPRESSION_CODEC.key, "zstd")
+          spark.conf.set(SQLConf.ARROW_EXECUTION_ZSTD_COMPRESSION_LEVEL.key, 
"-1")
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "id * 2L as long_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      // Run Arrow cache with zstd level 1 compression benchmark
+      benchmark.addCase("Arrow cache - write + read (zstd level 1)") { _ =>
+        val spark = 
createFreshSession(classOf[ArrowCachedBatchSerializer].getName)
+        try {
+          spark.conf.set(SQLConf.ARROW_EXECUTION_COMPRESSION_CODEC.key, "zstd")
+          spark.conf.set(SQLConf.ARROW_EXECUTION_ZSTD_COMPRESSION_LEVEL.key, 
"1")
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "id * 2L as long_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      // Run Arrow cache with zstd level 3 (default) compression benchmark
+      benchmark.addCase("Arrow cache - write + read (zstd level 3)") { _ =>
+        val spark = 
createFreshSession(classOf[ArrowCachedBatchSerializer].getName)
+        try {
+          spark.conf.set(SQLConf.ARROW_EXECUTION_COMPRESSION_CODEC.key, "zstd")
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "id * 2L as long_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      benchmark.run()
+    }
+  }
+
+  private def cacheWithFilters(): Unit = {
+    val numRows = 5000000 // 5M rows
+    runBenchmark("Cache with filter pushdown") {
+      val benchmark = new Benchmark("Cache 5M rows + filter", numRows, output 
= output)
+
+      // Default cache filter benchmark (with compression - default)
+      benchmark.addCase("Default cache - filter") { _ =>
+        val spark = createFreshSession(
+          
"org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer")
+        try {
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save() // Materialize 
cache by reading all rows
+          df.filter("int_col > 2500000").count()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      // Default cache filter without compression
+      benchmark.addCase("Default cache - filter (uncompressed)") { _ =>
+        val spark = createFreshSession(
+          
"org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer")
+        try {
+          spark.conf.set("spark.sql.inMemoryColumnarStorage.compressed", 
"false")
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save() // Materialize 
cache by reading all rows
+          df.filter("int_col > 2500000").count()
+          df.unpersist(blocking = true)
+        } finally {
+          spark.stop()
+        }
+      }
+
+      // Arrow cache filter benchmark
+      benchmark.addCase("Arrow cache - filter (with stats)") { _ =>
+        val spark = 
createFreshSession(classOf[ArrowCachedBatchSerializer].getName)
+        try {
+          val df = spark.range(numRows).selectExpr(
+            "id as int_col",
+            "cast(id as double) as double_col"
+          )
+          df.cache()
+          df.write.format("noop").mode("overwrite").save() // Materialize 
cache by reading all rows
+          df.filter("int_col > 2500000").count()

Review Comment:
   Fixed. I've triggered the benchmark workflow (JDK 17/21/25) to regenerate 
all three result files from the current head, so they will carry the renamed 
"Cache then filter" labels and drop the stale pruning-attribution naming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57268][SQL] Add Apache Arrow as a native cache format for in-memory Dataset caching [spark]

Reply via email to