[ 
https://issues.apache.org/jira/browse/ARROW-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858313#comment-15858313
 ] 

Wes McKinney commented on ARROW-542:
------------------------------------

I raised this issue in the PR discussion for ARROW-366. The problem right now 
is that the Field POJO in the Java library is tightly coupled to the JSON and 
Flatbuffer metadata representation, which has a dictionary id. 

When you read the Flatbuffer metadata to reconstruct a dictionary vector, the 
process is:

* Field has dictionary id K
* You examine the dictionary batches and find the dictionary batch metadata 
with id K
* Read that dictionary
* Construct the dictionary vector

When you are writing the metadata / to IPC, the process is reversed:

* Extract all Dictionary objects from the record batch
* Assign a unique id to each dictionary
* Write the corresponding unique id for the field metadata that the dictionary 
vector
* Write a DictionaryBatch with that dictionary and id for each dictionary

In the case of streaming, based on the number of observed dictionaries in the 
metadata, we would expect to read that number of dictionary batches as the 
first messages in the stream, followed by normal record batches (which will 
contain the dictionary indices)

Let me know if this makes sense

> [Java] Implement dictionaries in stream/file encoding
> -----------------------------------------------------
>
>                 Key: ARROW-542
>                 URL: https://issues.apache.org/jira/browse/ARROW-542
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java - Vectors
>            Reporter: Emilio Lahr-Vivaz
>            Assignee: Emilio Lahr-Vivaz
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to