[ 
https://issues.apache.org/jira/browse/ARROW-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-9660.
-----------------------------------
    Resolution: Fixed

Issue resolved by pull request 7992
[https://github.com/apache/arrow/pull/7992]

> [C++] IPC - dictionaries in maps
> --------------------------------
>
>                 Key: ARROW-9660
>                 URL: https://issues.apache.org/jira/browse/ARROW-9660
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 1.0.0
>            Reporter: Pierre Belzile
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I created the following record batch which has a single column with a type of 
> map<dict, string> where dict is defined as: dict<int8,string>:
>  
> {code:java}
> arrow::MapBuilder map_builder(arrow::default_memory_pool(),
>     std::make_shared<arrow::StringDictionaryBuilder>(),
>     std::make_shared<arrow::StringBuilder>());
> auto key_builder = 
>     dynamic_cast<arrow::StringDictionaryBuilder *>(map_builder.key_builder());
> auto item_builder = 
>     dynamic_cast<arrow::StringBuilder *>(map_builder.item_builder());
> // Add a first row with k<i>=v<i> for i 0..14;
> ASSERT_OK(map_builder.Append());
> for (int i = 0; i < 15; ++i) {
>   ASSERT_OK(key_builder->Append("k" + std::to_string(i)));
>   ASSERT_OK(item_builder->Append("v" + std::to_string(i)));
> }
> // Add a second row with k<i>=w<i> for i 0..14;
> ASSERT_OK(map_builder.Append());
> for (int i = 0; i < 15; ++i) {
>   ASSERT_OK(key_builder->Append("k" + std::to_string(i)));
>   ASSERT_OK(item_builder->Append("w" + std::to_string(i)));
> }
> std::shared_ptr<arrow::Array> array;
> ASSERT_OK(map_builder.Finish(&array));
> std::shared_ptr<arrow::Schema> schema = 
>     arrow::schema({arrow::field("s", array->type())});
> std::shared_ptr<arrow::RecordBatch> batch = 
>     arrow::RecordBatch::Make(schema, array->length(), {array});
> {code}
> When one attempts to send this in a round trip IPC:
>  # On IpcFormatWriter::Start(): The memo records one entry for field_to_id 
> and id_to_type_ where the dict id = 0.
>  # On IpcFormatWriter::CollectDictionaries: The memo records a new entry for 
> field_to_id and id_to_type with id=1 and also records in id_to_dictionary_. 
> At this point we have 2 entries with the entry id=0 having no associated dict.
>  # On IpcFormatWriter;:WriteDictionaries: It writes the dict with entry = 1
> When reading:
>  # GetSchema eventually gets to the nested dictionary in FieldFromFlatBuffer
>  # The recovered dict id is 0.
>  # This adds to the memo the field_to_id and id_to_type with id = 0
>  # My round trip code calls "ReadAll".
>  # RecordBatchStreamReaderImpl::ReadNext attempts to load the initial dicts
>  # It recovers id = 1
>  # The process aborts because id = 1 is not in the memo: 
> dictionary_memo->GetDictionaryType(id, &value_type)
> A similar example with a dict inside a "struct" worked fine and only used 
> dict id = 0. So it looks like something wrong when gathering the schema for 
> the map. Unless I did not construct the map correctly?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to