viirya commented on code in PR #56334:
URL: https://github.com/apache/spark/pull/56334#discussion_r3487468545
##########
sql/api/src/main/scala/org/apache/spark/sql/util/ArrowUtils.scala:
##########
@@ -38,6 +38,50 @@ private[sql] object ArrowUtils {
// todo: support more types.
+ /**
+ * Check if a Spark DataType is supported by Arrow. This recursively checks
complex types
+ * (Array, Struct, Map).
+ *
+ * Note: This checks compatibility with toArrowField(), not toArrowType().
Types like
+ * GeometryType, GeographyType, and VariantType are not supported by
toArrowType() (which only
+ * handles primitive Arrow types), but ARE supported by toArrowField() which
converts them to
+ * Arrow Struct representations with metadata. Since Arrow cache uses
toArrowField() via
+ * toArrowSchema() to create the schema, these types are supported.
+ */
+ def isSupportedByArrow(dt: DataType): Boolean = {
+ dt match {
+ // Primitive types
+ case BooleanType | ByteType | ShortType | IntegerType | LongType |
FloatType | DoubleType |
+ _: StringType | BinaryType | NullType =>
+ true
+
+ // Decimal
+ case _: DecimalType => true
+
+ // Temporal types
+ case DateType | TimestampType | TimestampNTZType | _: TimeType => true
Review Comment:
Follow-up filed: SPARK-57735 / https://github.com/apache/spark/pull/56842
adds nanosecond-timestamp support to the default in-memory cache
(`DefaultCachedBatchSerializer`), which is the prerequisite -- once that lands,
the Arrow cache can route these types through the same statistics machinery
rather than rejecting them here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]