proteetpaul-desco opened a new issue, #43334:
URL: https://github.com/apache/arrow/issues/43334
### Describe the enhancement requested
We have a use-case to transfer filtered Arrow arrays, which can be chunked,
from a CPP server to a Python client. The downstream flow in the Python client
involves creation of NumPy arrays from the Arrow arrays, which necessitates
flattening of the chunks in an Arrow chunked array. This flattening process,
whether done on the client or server side, incurs the time and memory overhead
of creating additional copies.
As a solution, we propose enhancing the `RecordBatchWriter` class to
optionally concatenate Arrow buffers during network transfer. This enhancement
would allow the sending process to send multiple buffers sequentially over the
network socket, while the receiving process would interpret these buffers as a
single contiguous unit.
Please let us know if this idea sounds agreeable. We are willing to
implement the solution ourself.
**Example code snippet:**
<ins>Server (sender): [Replaced CPP code with python for simplicity]</ins>
```
>> arr = pyarrow.array([1,2,3])
>> chunked_arr = pyarrow.chunked_array([arr, arr])
>> tbl = pyarrow.table([chunked_arr], names=('a'))
>> options = pyarrow.ipc.IpcWriteOptions()
>> options.unify_array_chunks = True # Proposed IPC write option to enable
unification of array chunks on the wire
>> writer = pyarrow.RecordBatchStreamWriter(<stream>, tbl.schema,
options=options)
>> writer.write_table(tbl)
```
<ins>Client (receiver):</ins>
```
>> reader = pyarrow.RecordBatchStreamReader(<stream>)
>> tbl = reader.read_all()
>> tbl.columns[0] # Should have a single chunk
<pyarrow.lib.ChunkedArray object>
[
[
1,
2,
3,
1,
2,
3
]
]
```
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]