Liangcai li created ARROW-17912:
-----------------------------------

             Summary: Arrow C++ IPC fails to send an empty table, but Arrow 
Java can do it.
                 Key: ARROW-17912
                 URL: https://issues.apache.org/jira/browse/ARROW-17912
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Liangcai li


My current work is about Pyspark Cogroup Pandas UDF. And two processes are 
involved, the JVM one (sender) and the Python one (receiver).

[Spark is using the Arrow Java 
`ArrowStreamWriter`|https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/python/CoGroupedArrowPythonRunner.scala#L99]
 to serialize Arrow tables being sent from the JVM process to the Python 
process, and ArrowStreamWriter can handle empty tables correctly.

[While cuDF is using the Arrow C++ RecordBatchWriter 
|https://github.com/rapidsai/cudf/blob/branch-22.10/java/src/main/native/src/TableJni.cpp#L254]to
 do the same serialization, but it leads to an error as below on the Python 
side, where [the Pyspark is calling Pyarrow 
Table.from_batches|https://github.com/apache/spark/blob/branch-3.3/python/pyspark/sql/pandas/serializers.py#L366]
 to deserialize the arrow stream.

```

'Must pass schema, or at least one RecordBatch'

```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to