Re: [PR] [SPARK-56999][PYSPARK] Fix mapInArrow opaque getInt error by coercing output to declared schema [spark]

via GitHub Mon, 01 Jun 2026 17:29:10 -0700


gaogaotiantian commented on code in PR #56049:
URL: https://github.com/apache/spark/pull/56049#discussion_r3337877522



##########
python/pyspark/sql/tests/arrow/test_arrow_map.py:
##########
@@ -79,6 +79,26 @@ def func(iterator):
         expected = df.collect()
         self.assertEqual(actual, expected)
 
+    def test_coerce_output_type_to_declared_schema(self):
+        # Regression test: when the user yields a batch whose Arrow type does
+        # not match the declared output schema, the worker should coerce it
+        # rather than letting the JVM fail later with an opaque getInt error
+        # on the wrong ArrowColumnVector accessor.
+        from pyspark.sql.types import IntegerType, StructField, StructType
+
+        def double_x(iter_batches):
+            for batch in iter_batches:
+                # The input column is long (int64); produce int64 output even

Review Comment:
   Does the input column matter? I saw 3 types here - `type=pa.int64()`, 
inferred long type for `createDataFrame` and `IntegerType()`. Do all 3 matter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56999][PYSPARK] Fix mapInArrow opaque getInt error by coercing output to declared schema [spark]

Reply via email to