acruise commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3382377240
##########
docs/sql-arrow-cache-format.md:
##########
@@ -0,0 +1,312 @@
+# Apache Arrow Cache Format for Spark
+
+## Overview
+
+Apache Spark supports using Apache Arrow as an alternative cache format for
in-memory Dataset caching. This format provides improved performance for
certain workloads, especially when working with columnar data sources like
Parquet and ORC.
+
+## Benefits
+
+The Arrow cache format offers several advantages over the default cache format:
+
+- **Zero-copy reads** when input is already in Arrow format (e.g., Arrow-based
data sources, re-caching Arrow cached data)
+- **Better filter pushdown** with min/max statistics for partition pruning
+- **Off-heap memory management** via Arrow allocators
+- **Efficient compression** with zstd and lz4 codecs
+- **Arrow ecosystem interoperability** for data sharing
+
+**Note**: Spark's built-in Parquet/ORC readers use internal column vectors
(`OnHeapColumnVector`/`OffHeapColumnVector`), not Arrow format, so they don't
benefit from zero-copy optimization.
+
+## Configuration
+
+To enable Arrow cache format, set the static configuration:
+
+```scala
+spark.conf.set("spark.sql.cache.serializer",
+ "org.apache.spark.sql.execution.columnar.ArrowCachedBatchSerializer")
Review Comment:
This denotes an exclusive setting, do we expect any scenarios where this
cache serializer wouldn't work and a fallback might be necessary?
##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
// todo: support more types.
+ /**
+ * Check if a Spark DataType is supported by Arrow. This recursively checks
complex types
+ * (Array, Struct, Map).
+ *
+ * Note: This checks compatibility with toArrowField(), not toArrowType().
Types like
+ * GeometryType, GeographyType, and VariantType are not supported by
toArrowType() (which only
+ * handles primitive Arrow types), but ARE supported by toArrowField() which
converts them to
+ * Arrow Struct representations with metadata. Since Arrow cache uses
toArrowField() via
+ * toArrowSchema() to create the schema, these types are supported.
+ */
+ def isSupportedByArrow(dt: DataType): Boolean = {
Review Comment:
Presumably when this returns `false` for any reason, we fallback to the
default cache driver, that should be made clear in the docs if it isn't already
##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
// todo: support more types.
+ /**
+ * Check if a Spark DataType is supported by Arrow. This recursively checks
complex types
+ * (Array, Struct, Map).
+ *
+ * Note: This checks compatibility with toArrowField(), not toArrowType().
Types like
+ * GeometryType, GeographyType, and VariantType are not supported by
toArrowType() (which only
+ * handles primitive Arrow types), but ARE supported by toArrowField() which
converts them to
+ * Arrow Struct representations with metadata. Since Arrow cache uses
toArrowField() via
+ * toArrowSchema() to create the schema, these types are supported.
+ */
+ def isSupportedByArrow(dt: DataType): Boolean = {
Review Comment:
e.g. the doc
[says](https://github.com/apache/spark/pull/56334/changes#diff-bce75e361105d2c33d929d323f5128316b3a446496e7999f726dae5bdb291167R118)
`supports all Spark SQL data types` but this implementation would seem to
falsify that claim ;)
##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
// todo: support more types.
+ /**
+ * Check if a Spark DataType is supported by Arrow. This recursively checks
complex types
+ * (Array, Struct, Map).
+ *
+ * Note: This checks compatibility with toArrowField(), not toArrowType().
Types like
+ * GeometryType, GeographyType, and VariantType are not supported by
toArrowType() (which only
+ * handles primitive Arrow types), but ARE supported by toArrowField() which
converts them to
+ * Arrow Struct representations with metadata. Since Arrow cache uses
toArrowField() via
+ * toArrowSchema() to create the schema, these types are supported.
+ */
+ def isSupportedByArrow(dt: DataType): Boolean = {
Review Comment:
Well, I guess the claim may be true, and the fallthrough at the end might be
defensive... In which case maybe we'd want to log a surprisingly unsupported
type :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]