[
https://issues.apache.org/jira/browse/ARROW-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Liangcai li updated ARROW-17912:
--------------------------------
Description:
My current work is about Pyspark Cogroup Pandas UDF. And two processes are
involved, the JVM one (sender) and the Python one (receiver).
[Spark is using the Arrow Java
`ArrowStreamWriter`|https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/python/CoGroupedArrowPythonRunner.scala#L99]
to serialize Arrow tables being sent from the JVM process to the Python
process, and ArrowStreamWriter can handle empty tables correctly.
[While cuDF is using the Arrow C++ RecordBatchWriter
|https://github.com/rapidsai/cudf/blob/branch-22.10/java/src/main/native/src/TableJni.cpp#L254]to
do the same serialization, but it leads to an error as below on the Python
side, where [the Pyspark is calling Pyarrow
*Table.from_batches*|https://github.com/apache/spark/blob/branch-3.3/python/pyspark/sql/pandas/serializers.py#L366]
to deserialize the arrow stream.
```
{color:#de350b}_'Must pass schema, or at least one RecordBatch'_{color}
```
was:
My current work is about Pyspark Cogroup Pandas UDF. And two processes are
involved, the JVM one (sender) and the Python one (receiver).
[Spark is using the Arrow Java
`ArrowStreamWriter`|https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/python/CoGroupedArrowPythonRunner.scala#L99]
to serialize Arrow tables being sent from the JVM process to the Python
process, and ArrowStreamWriter can handle empty tables correctly.
[While cuDF is using the Arrow C++ RecordBatchWriter
|https://github.com/rapidsai/cudf/blob/branch-22.10/java/src/main/native/src/TableJni.cpp#L254]to
do the same serialization, but it leads to an error as below on the Python
side, where [the Pyspark is calling Pyarrow
Table.from_batches|https://github.com/apache/spark/blob/branch-3.3/python/pyspark/sql/pandas/serializers.py#L366]
to deserialize the arrow stream.
```
{color:#de350b}_'Must pass schema, or at least one RecordBatch'_{color}
```
> Arrow C++ IPC fails to send an empty table, but Arrow Java can do it.
> ---------------------------------------------------------------------
>
> Key: ARROW-17912
> URL: https://issues.apache.org/jira/browse/ARROW-17912
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Liangcai li
> Priority: Major
>
> My current work is about Pyspark Cogroup Pandas UDF. And two processes are
> involved, the JVM one (sender) and the Python one (receiver).
> [Spark is using the Arrow Java
> `ArrowStreamWriter`|https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/python/CoGroupedArrowPythonRunner.scala#L99]
> to serialize Arrow tables being sent from the JVM process to the Python
> process, and ArrowStreamWriter can handle empty tables correctly.
> [While cuDF is using the Arrow C++ RecordBatchWriter
> |https://github.com/rapidsai/cudf/blob/branch-22.10/java/src/main/native/src/TableJni.cpp#L254]to
> do the same serialization, but it leads to an error as below on the Python
> side, where [the Pyspark is calling Pyarrow
> *Table.from_batches*|https://github.com/apache/spark/blob/branch-3.3/python/pyspark/sql/pandas/serializers.py#L366]
> to deserialize the arrow stream.
> ```
> {color:#de350b}_'Must pass schema, or at least one RecordBatch'_{color}
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)