viirya commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3470957437
##########
docs/sql-arrow-cache-format.md:
##########
@@ -0,0 +1,312 @@
+# Apache Arrow Cache Format for Spark
+
+## Overview
+
+Apache Spark supports using Apache Arrow as an alternative cache format for
in-memory Dataset caching. This format provides improved performance for
certain workloads, especially when working with columnar data sources like
Parquet and ORC.
+
+## Benefits
+
+The Arrow cache format offers several advantages over the default cache format:
+
+- **Zero-copy reads** when input is already in Arrow format (e.g., Arrow-based
data sources, re-caching Arrow cached data)
+- **Better filter pushdown** with min/max statistics for partition pruning
+- **Off-heap memory management** via Arrow allocators
+- **Efficient compression** with zstd and lz4 codecs
+- **Arrow ecosystem interoperability** for data sharing
+
+**Note**: Spark's built-in Parquet/ORC readers use internal column vectors
(`OnHeapColumnVector`/`OffHeapColumnVector`), not Arrow format, so they don't
benefit from zero-copy optimization.
+
+## Configuration
+
+To enable Arrow cache format, set the static configuration:
+
+```scala
+spark.conf.set("spark.sql.cache.serializer",
+ "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer")
Review Comment:
The cache serializer is selected once by the static
`spark.sql.cache.serializer` conf, and once set it handles every cached
relation -- Spark's cache framework has no per-relation fallback to a different
serializer based on data type. So there's no automatic switch back to the
default cache driver. The Arrow serializer aims to cover everything the default
serializer does (and more -- e.g. geometry/geography, which the default has no
encoding for). If a column uses a type Arrow genuinely cannot represent (e.g.
`ObjectType`), materializing the cache throws `UNSUPPORTED_DATATYPE` rather
than silently falling back. I've clarified both points in the docs.
##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
// todo: support more types.
+ /**
+ * Check if a Spark DataType is supported by Arrow. This recursively checks
complex types
+ * (Array, Struct, Map).
+ *
+ * Note: This checks compatibility with toArrowField(), not toArrowType().
Types like
+ * GeometryType, GeographyType, and VariantType are not supported by
toArrowType() (which only
+ * handles primitive Arrow types), but ARE supported by toArrowField() which
converts them to
+ * Arrow Struct representations with metadata. Since Arrow cache uses
toArrowField() via
+ * toArrowSchema() to create the schema, these types are supported.
+ */
+ def isSupportedByArrow(dt: DataType): Boolean = {
Review Comment:
Good catch -- "supports all Spark SQL data types" did overstate it, and the
type list was also incomplete (it omitted Time, intervals, Geometry/Geography,
Variant, Null, and UDTs, which are all supported). Fixed the doc to list the
actually-supported set and to describe the unsupported-type behavior.
One clarification on the mechanism, re: your fallback question:
`isSupportedByArrow` here only gates `supportsColumnarInput`, which the cache
framework uses to choose the columnar-vs-row *input* path into this same
serializer -- it isn't a fallback to the default cache driver (there's no such
per-type fallback). A truly unsupported type isn't silently dropped either:
`toArrowSchema` throws `UNSUPPORTED_DATATYPE` when the cache is materialized.
The docs now state this explicitly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]