David Li created ARROW-8749:
-------------------------------

             Summary: [C++] IpcFormatWriter writes dictionary batches with 
wrong ID
                 Key: ARROW-8749
                 URL: https://issues.apache.org/jira/browse/ARROW-8749
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 0.17.0
            Reporter: David Li


IpcFormatWriter assigns dictionary IDs once when it writes the schema message. 
Then, when it writes dictionary batches, it assigns dictionary IDs again 
because it re-collects dictionaries from the given batch. So for example, if 
you have 5 dictionaries, the first dictionary will end up with ID 0 but be 
written with ID 5.

For example, this will fail with "'_error_or_value11.status()' failed with Key 
error: No record of dictionary type with id 9"
{code:cpp}
TEST_F(TestMetadata, DoPutDictionaries) {
  ASSERT_OK_AND_ASSIGN(auto sink, arrow::io::BufferOutputStream::Create());
  std::shared_ptr<Schema> schema = ExampleDictSchema();
  BatchVector expected_batches;
  ASSERT_OK(ExampleDictBatches(&expected_batches));
  ASSERT_OK_AND_ASSIGN(auto writer, arrow::ipc::NewStreamWriter(sink.get(), 
schema));
  for (auto& batch : expected_batches) {
    ASSERT_OK(writer->WriteRecordBatch(*batch));
  }
  ASSERT_OK_AND_ASSIGN(auto buf, sink->Finish());
  arrow::io::BufferReader source(buf);
  ASSERT_OK_AND_ASSIGN(auto reader, 
arrow::ipc::RecordBatchStreamReader::Open(&source));
  AssertSchemaEqual(schema, reader->schema());
  for (auto& batch : expected_batches) {
    ASSERT_OK_AND_ASSIGN(auto actual, reader->Next());
    AssertBatchesEqual(*actual, *batch);
  }
}{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to