[
https://issues.apache.org/jira/browse/ARROW-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612091#comment-17612091
]
Liangcai li edited comment on ARROW-17912 at 10/2/22 12:23 PM:
---------------------------------------------------------------
After some investigation, according to [the code
here|https://github.com/apache/arrow/blob/release-8.0.0/cpp/src/arrow/table.cc#L621],
it look likes the *_TableBatchReader_* will return null if given an empty
table, not an empty batch. So the [RecordBatchWriter will skip the empty table,
and write no batch into the
stream|https://github.com/apache/arrow/blob/release-8.0.0/cpp/src/arrow/ipc/writer.cc#L964].
But the [Pyarrow requires at least one batch
|https://github.com/apache/arrow/blob/release-7.0.0/python/pyarrow/table.pxi#L1936]when
the schema is not specified.
Maybe we can do something as below to cover the empty table case, by adding a
new boolean variable "{*}_reach_end_{*}".
```
_Status TableBatchReader::ReadNext(std::shared_ptr<RecordBatch>* out) {_
_*{color:#de350b}- if (absolute_row_position_ == table_.num_rows()) {{color}*_
_*{color:#00875a}+ if (reach_end) {{color}*_
_*{color:#00875a}*out = nullptr;{color}*_
_*{color:#00875a}return Status::OK();{color}*_
_@@ -666,6 +668,10 @@ Status
TableBatchReader::ReadNext(std::shared_ptr<RecordBatch>* out) {_
_absolute_row_position_ += chunksize;_
{_}*out = RecordBatch::Make(table{_}.schema(), chunksize,
std::move(batch_data));_
{color:#00875a}*_+ if (absolute_row_position_ == table_.num_rows()) {*{color}
*{color:#00875a}_+ reach_end = true;_{color}*
*{color:#00875a}_+ }_{color}*
*{color:#00875a}_+_{color}*
_return Status::OK();_
_}_
```
was (Author: JIRAUSER296424):
After some investigation, according to [the code
here|https://github.com/apache/arrow/blob/release-8.0.0/cpp/src/arrow/table.cc#L621],
it look likes the *_TableBatchReader_* will return null if given an empty
table, not an empty batch. So the [RecordBatchWriter will skip the empty table,
and write no batch into the
stream|https://github.com/apache/arrow/blob/release-8.0.0/cpp/src/arrow/ipc/writer.cc#L964].
But the [Pyarrow requires at least one batch
|https://github.com/apache/arrow/blob/release-7.0.0/python/pyarrow/table.pxi#L1936]when
the schema is not specified.
Maybe we can do something as below to cover the empty table case, by adding a
new boolean variable "{*}_reach_end_{*}".
```
_Status TableBatchReader::ReadNext(std::shared_ptr<RecordBatch>* out) {_
_*{color:#de350b}- if (absolute_row_position_ == table_.num_rows()) {{color}*_
_*{color:#00875a}+ if (reach_end) {{color}*_
_*{color:#00875a}*out = nullptr;{color}*_
_*{color:#00875a}return Status::OK();{color}*_
_@@ -666,6 +668,10 @@ Status
TableBatchReader::ReadNext(std::shared_ptr<RecordBatch>* out) {_
_absolute_row_position_ += chunksize;_
{_}*out = RecordBatch::Make(table{_}.schema(), chunksize,
std::move(batch_data));_
*_+ if (absolute_row_position_ == table_.num_rows()) {_*
*{color:#00875a}_+ reach_end = true;_{color}*
*{color:#00875a}_+ }_{color}*
*{color:#00875a}_+_{color}*
_return Status::OK();_
_}_
```
> Arrow C++ IPC fails to send an empty table, but Arrow Java can do it.
> ---------------------------------------------------------------------
>
> Key: ARROW-17912
> URL: https://issues.apache.org/jira/browse/ARROW-17912
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Liangcai li
> Priority: Major
>
> My current work is about Pyspark Cogroup Pandas UDF. And two processes are
> involved, the JVM one (sender) and the Python one (receiver).
> [Spark is using the Arrow Java
> `ArrowStreamWriter`|https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/python/CoGroupedArrowPythonRunner.scala#L99]
> to serialize Arrow tables being sent from the JVM process to the Python
> process, and ArrowStreamWriter can handle empty tables correctly.
> [While cuDF is using the Arrow C++ RecordBatchWriter
> |https://github.com/rapidsai/cudf/blob/branch-22.10/java/src/main/native/src/TableJni.cpp#L254]to
> do the same serialization, but it leads to an error as below on the Python
> side, where [the Pyspark is calling Pyarrow
> *Table.from_batches*|https://github.com/apache/spark/blob/branch-3.3/python/pyspark/sql/pandas/serializers.py#L366]
> to deserialize the arrow stream.
> ```
> _E File
> "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py",
> line 297, in load_stream_
> _E [self.arrow_to_pandas(c) for c in
> pa.Table.from_batches(batch2).itercolumns()]_
> _E File "pyarrow/table.pxi", line 1609, in
> pyarrow.lib.Table.from_batches_
> _E {color:#de350b}*ValueError: Must pass schema, or at
> least one RecordBatch*{color}_
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)