rustyconover opened a new issue, #49258:
URL: https://github.com/apache/arrow/issues/49258
### Describe the bug, including details regarding any error messages,
version, and platform.
Hi Arrow Friends,
RecordBatch.serialize() on a batch with dictionary encoded columns produces
a single IPC record batch message containing only the indices. The dictionary
values message is not included. Attempting to read the result back with
`ipc.read_record_batch()` fails with `ArrowKeyError: Dictionary field not
found.`
A full IPC stream via `ipc.new_stream` correctly writes schema, dictionary,
and record batch messages and round-trips fine.
This is surprising because serialize() succeeds without error, producing
bytes that cannot be deserialized.
I think we have some gaps in the PyArrow IPC API regarding `DictionaryMemo`
and the ability to interpret `Dictionary` IPC messages.
To see the issue:
```python
import pyarrow as pa
from pyarrow import ipc
arr = pa.array(["apple", "banana", "apple", "cherry",
"banana"]).dictionary_encode()
batch = pa.record_batch([arr], names=["fruit"])
# serialize() produces only a record batch message — no dictionary message
raw = batch.serialize().to_pybytes()
reader = pa.BufferReader(pa.py_buffer(raw))
while (msg := ipc.read_message(reader)) is not None:
print(msg.type)
# Output: record batch
# Round-trip fails
ipc.read_record_batch(pa.py_buffer(raw), batch.schema)
# ArrowKeyError: Dictionary field not found
```
For comparison, a full IPC stream includes the dictionary:
```python
sink = pa.BufferOutputStream()
writer = ipc.new_stream(sink, batch.schema)
writer.write_batch(batch)
writer.close()
stream = sink.getvalue().to_pybytes()
reader = pa.BufferReader(pa.py_buffer(stream))
while (msg := ipc.read_message(reader)) is not None:
print(msg.type)
# Output: schema, dictionary, record batch
```
This bug is kind of a motivating/organizing one. I'm going to be creating a
series of PRs that increase the PyArrow API around dictionary messages so that
it is easier for users to serialize IPC RecordBatches with dictionaries.
Rusty
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]