[ https://issues.apache.org/jira/browse/SPARK-53029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allison Wang updated SPARK-53029: --------------------------------- Description: Address this follow up discussion [https://github.com/apache/spark/pull/51692#discussion_r2241413025] Currently @arrow_udtf(returnType="x int") ...: class MyArrowUDTF: ...: def eval(self, batch: pa.RecordBatch): ...: yield pa.table(\{"x": batch.column(0)}) Will throw this error In [12]: MyArrowUDTF(df.asTable()).show() 25/08/22 11:55:23 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 22) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "pyarrow/table.pxi", line 4865, in pyarrow.lib.Table.from_batches File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Schema at index 0 was different: x: int32 vs x: int64 at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:622) at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:130) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:573) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) And this will block forever. was: Address this follow up discussion [https://github.com/apache/spark/pull/51692#discussion_r2241413025] > Support return type coercion for Arrow Python UDTFs > --------------------------------------------------- > > Key: SPARK-53029 > URL: https://issues.apache.org/jira/browse/SPARK-53029 > Project: Spark > Issue Type: Sub-task > Components: PySpark > Affects Versions: 4.1.0 > Reporter: Allison Wang > Priority: Major > > Address this follow up discussion > [https://github.com/apache/spark/pull/51692#discussion_r2241413025] > Currently > @arrow_udtf(returnType="x int") > ...: class MyArrowUDTF: > ...: def eval(self, batch: pa.RecordBatch): > ...: yield pa.table(\{"x": batch.column(0)}) > Will throw this error > In [12]: MyArrowUDTF(df.asTable()).show() > 25/08/22 11:55:23 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 22) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "pyarrow/table.pxi", line 4865, in pyarrow.lib.Table.from_batches > File "pyarrow/error.pxi", line 155, in > pyarrow.lib.pyarrow_internal_check_status > File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status > pyarrow.lib.ArrowInvalid: Schema at index 0 was different: > x: int32 > vs > x: int64 > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:622) > at > org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:130) > at > org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:573) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:583) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:601) > > And this will block forever. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org