Joris Peeters created ARROW-11869:
-------------------------------------
Summary: [Java] Support re-emitting dictionaries in
ArrowStreamWriter
Key: ARROW-11869
URL: https://issues.apache.org/jira/browse/ARROW-11869
Project: Apache Arrow
Issue Type: Improvement
Components: Java
Reporter: Joris Peeters
Assignee: Joris Peeters
The ArrowStreamWriter currently takes a DictionaryProvider at construction time
and emits the used dicts once.
However, the streaming format allows for the dictionaries to change between
record batches. It would be useful to support this mechanism. It can be worked
around in various ways (e.g. manually re-emitting DictionaryBatches between
calling writeBatch), but this isn't very pleasant.
We'd somehow have to reconcile this with the abstract ArrowWriter parent and
the ArrowFileWriter sibling. In the latter, for example, this mechanism is not
supported.
An example solution (but perhaps we can do better) might be to add a virtual
`writeBatch(Provider provider)` method, that is UnsupportedOperationException
in ArrowFileWriter, and re-emits the used dicts in ArrowStreamWriter.
In the present context just looking at dictionary replacement, not dictionary
delta's.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)