[
https://issues.apache.org/jira/browse/ARROW-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858313#comment-15858313
]
Wes McKinney commented on ARROW-542:
------------------------------------
I raised this issue in the PR discussion for ARROW-366. The problem right now
is that the Field POJO in the Java library is tightly coupled to the JSON and
Flatbuffer metadata representation, which has a dictionary id.
When you read the Flatbuffer metadata to reconstruct a dictionary vector, the
process is:
* Field has dictionary id K
* You examine the dictionary batches and find the dictionary batch metadata
with id K
* Read that dictionary
* Construct the dictionary vector
When you are writing the metadata / to IPC, the process is reversed:
* Extract all Dictionary objects from the record batch
* Assign a unique id to each dictionary
* Write the corresponding unique id for the field metadata that the dictionary
vector
* Write a DictionaryBatch with that dictionary and id for each dictionary
In the case of streaming, based on the number of observed dictionaries in the
metadata, we would expect to read that number of dictionary batches as the
first messages in the stream, followed by normal record batches (which will
contain the dictionary indices)
Let me know if this makes sense
> [Java] Implement dictionaries in stream/file encoding
> -----------------------------------------------------
>
> Key: ARROW-542
> URL: https://issues.apache.org/jira/browse/ARROW-542
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Java - Vectors
> Reporter: Emilio Lahr-Vivaz
> Assignee: Emilio Lahr-Vivaz
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)