[jira] [Updated] (SPARK-53029) Support return type coercion for Arrow Python UDTFs

Allison Wang (Jira) Fri, 22 Aug 2025 11:58:37 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-53029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Allison Wang updated SPARK-53029:
---------------------------------
    Description: 
Address this follow up discussion

[https://github.com/apache/spark/pull/51692#discussion_r2241413025]

Currently

 @arrow_udtf(returnType="x int")
    ...: class MyArrowUDTF:
    ...:     def eval(self, batch: pa.RecordBatch):
    ...:         yield pa.table(\{"x": batch.column(0)})

Will throw this error

In [12]: MyArrowUDTF(df.asTable()).show()
25/08/22 11:55:23 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 22)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "pyarrow/table.pxi", line 4865, in pyarrow.lib.Table.from_batches
  File "pyarrow/error.pxi", line 155, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 0 was different:
x: int32
vs
x: int64
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:622)
at 
org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:130)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:573)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
 
And this will block forever.

  was:
Address this follow up discussion

[https://github.com/apache/spark/pull/51692#discussion_r2241413025]


> Support return type coercion for Arrow Python UDTFs
> ---------------------------------------------------
>
>                 Key: SPARK-53029
>                 URL: https://issues.apache.org/jira/browse/SPARK-53029
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.1.0
>            Reporter: Allison Wang
>            Priority: Major
>
> Address this follow up discussion
> [https://github.com/apache/spark/pull/51692#discussion_r2241413025]
> Currently
>  @arrow_udtf(returnType="x int")
>     ...: class MyArrowUDTF:
>     ...:     def eval(self, batch: pa.RecordBatch):
>     ...:         yield pa.table(\{"x": batch.column(0)})
> Will throw this error
> In [12]: MyArrowUDTF(df.asTable()).show()
> 25/08/22 11:55:23 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 22)
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "pyarrow/table.pxi", line 4865, in pyarrow.lib.Table.from_batches
>   File "pyarrow/error.pxi", line 155, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Schema at index 0 was different:
> x: int32
> vs
> x: int64
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:622)
> at 
> org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:130)
> at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:573)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
> at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601)
>  
> And this will block forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-53029) Support return type coercion for Arrow Python UDTFs

Reply via email to