[ https://issues.apache.org/jira/browse/ARROW-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adam Hooper updated ARROW-6861: ------------------------------- Attachment: parquet-written-by-arrow-0-14-1.7z > arrow-0.15.0 reading arrow-0.14.1-output Parquet dictionary column: Failure > reading column: IOError: Arrow error: Invalid: Resize cannot downsize > ------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: ARROW-6861 > URL: https://issues.apache.org/jira/browse/ARROW-6861 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.15.0 > Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64) > Reporter: Adam Hooper > Priority: Major > Attachments: fix-dict-builder-capacity.diff, > parquet-written-by-arrow-0-14-1.7z > > > I'll need to jump through hoops to upload the (seemingly-valid) Parquet file > that triggers this bug. In the meantime, here's the error I get, reading the > Parquet file with read_dictionary=true. I'll start with the stack trace: > {{Failure reading column: IOError: Arrow error: Invalid: Resize cannot > downsize}} > {{#0 0x0000000000b9fffd in __cxa_throw ()}} > {{#1 0x00000000004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow > (this=0x555556612e50, num_values=67339, null_count=0, > valid_bits=0x7f39a764b780 '\377' <repeats 200 times>..., > valid_bits_offset=748544,}} > \{{ builder=0x555556616330) at > /src/apache-arrow-0.15.0/cpp/src/parquet/encoding.cc:886}} > {{#2 0x000000000046d703 in > parquet::internal::ByteArrayDictionaryRecordReader::ReadValuesSpaced > (this=0x555556616260, values_to_read=67339, null_count=0)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1314}} > {{#3 0x00000000004a13f8 in > parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> > >::ReadRecordData (this=0x555556616260, num_records=67339)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:1096}} > {{#4 0x0000000000493876 in > parquet::internal::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)6> > >::ReadRecords (this=0x555556616260, num_records=815883)}} > \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/column_reader.cc:875}} > {{#5 0x0000000000413955 in parquet::arrow::LeafReader::NextBatch > (this=0x555556615640, records_to_read=815883, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:413}} > {{#6 0x0000000000412081 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x5555566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:218}} > {{#7 0x00000000004121b0 in parquet::arrow::FileReaderImpl::ReadColumn > (this=0x5555566067a0, i=7, out=0x7ffd4b5afab0) at > /src/apache-arrow-0.15.0/cpp/src/parquet/arrow/reader.cc:223}} > {{#8 0x0000000000405fbd in readParquet(std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> > const&) ()}} > And now a report of my gdb adventures: > In Arrow 0.15.0, when reading a particular dictionary column > ({{read_dictionaries=true}}) with 815883 rows that was written by Arrow > 0.14.1, {{arrow::Dictionary32Builder<arrow::BinaryType>::AppendIndices(...)}} > is called twice (once with 493568 values, once with 254976 values); and then > {{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't > know why this column comes in three batches.) On first {{AppendIndices()}} > call, the buffer capacity is equal to the number of values. On second call, > that's no longer the case: the buffer grows using > {{BufferBuilder::GrowByFactor}}, so its capacity is 987136. > But there's a bug: the 987136-capacity buffer is in > {{Dictionary32Builder::indices_builder_}}; so 987136 is stored in > {{Dictionary32Builder::indices_builder_.capacity_}}. > {{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} > is called. (Dictionary32Builder behaves like a proxy for its > {{indices_builder_}}; but its {{capacity()}} method is not virtual, so things > are messy.) > So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, > via {{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But > {{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its > wrong, cached value) to {{length_ + num_values}} (815883). Since > {{indicies_builder->capacity_}} is 987136, that's a downsize – which throws > an exception. > The only workaround I can find: use {{read_dictionaries=false}}. > This affects Python, too. > I've attached a patch that fixes the issue for my file. I don't know how to > formulate a reduction, though, so I haven't contributed unit tests. I'm also > not certain how FinishInternal is meant to work, so this definitely needs > expert review. (FinishInternal was _definitely_ buggy before my patch; after > my patch it _might_ be buggy but I don't know.) -- This message was sent by Atlassian Jira (v8.3.4#803005)