Adam Hooper created ARROW-6861:

             Summary: With arrow-0.14.1-output Parquet dictionary column: 
Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize
                 Key: ARROW-6861
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.15.0
         Environment: debian:buster (in Docker, Linux 5.2.11-200.fc30.x86_64)
            Reporter: Adam Hooper
         Attachments: fix-dict-builder-capacity.diff

I'll need to jump through hoops to upload the (seemingly-valid) Parquet file 
that triggers this bug. In the meantime, here's the error I get, reading the 
Parquet file with read_dictionary=true. I'll start with the stack trace:

{{Failure reading column: IOError: Arrow error: Invalid: Resize cannot 

{{#0 0x0000000000b9fffd in __cxa_throw ()}}
 {{#1 0x00000000004ce7b5 in parquet::PlainByteArrayDecoder::DecodeArrow 
(this=0x555556612e50, num_values=67339, null_count=0, valid_bits=0x7f39a764b780 
'\377' <repeats 200 times>..., valid_bits_offset=748544,}}
 \{{ builder=0x555556616330) at 
 {{#2 0x000000000046d703 in 
(this=0x555556616260, values_to_read=67339, null_count=0)}}
 \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/}}
 {{#3 0x00000000004a13f8 in 
 >::ReadRecordData (this=0x555556616260, num_records=67339)}}
 \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/}}
 {{#4 0x0000000000493876 in 
 >::ReadRecords (this=0x555556616260, num_records=815883)}}
 \{{ at /src/apache-arrow-0.15.0/cpp/src/parquet/}}
 {{#5 0x0000000000413955 in parquet::arrow::LeafReader::NextBatch 
(this=0x555556615640, records_to_read=815883, out=0x7ffd4b5afab0) at 
 {{#6 0x0000000000412081 in parquet::arrow::FileReaderImpl::ReadColumn 
(this=0x5555566067a0, i=7, row_groups=..., out=0x7ffd4b5afab0) at 
 {{#7 0x00000000004121b0 in parquet::arrow::FileReaderImpl::ReadColumn 
(this=0x5555566067a0, i=7, out=0x7ffd4b5afab0) at 
 {{#8 0x0000000000405fbd in readParquet(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&) ()}}

And now a report of my gdb adventures:

In Arrow 0.15.0, when reading a particular dictionary column 
({{read_dictionaries=true}}) with 815883 rows that was written by Arrow 0.14.1, 
{{arrow::Dictionary32Builder<arrow::BinaryType>::AppendIndices(...)}} is called 
twice (once with 493568 values, once with 254976 values); and then 
{{PlainByteArrayDecoder::DecodeArrow()}} is called. (I'm a novice; I don't know 
why this column comes in three batches.) On first {{AppendIndices()}} call, the 
buffer capacity is equal to the number of values. On second call, that's no 
longer the case: the buffer grows using {{BufferBuilder::GrowByFactor}}, so its 
capacity is 987136.

But there's a bug: the 987136-capacity buffer is in 
{{Dictionary32Builder::indices_builder_}}; so 987136 is stored in 
{{Dictionary32Builder::capacity_}} does not change when {{AppendIndices()}} is 
called. (Dictionary32Builder behaves like a proxy for its {{indices_builder_}}; 
but its {{capacity()}} method is not virtual, so things are messy.)

So {{builder.capacity_}} is 0. Then comes the final batch of 67339 values, via 
{{DecodeArrow()}}. It calls {{builder->Reserve(num_values)}}. But 
{{builder->Reserve(num_values)}} tries to increase the capacity from 0 (its 
wrong, cached value) to {{length_ + num_values}} (815883). Since 
{{indicies_builder->capacity_}} is 987136, that's a downsize – which throws an 

The only workaround I can find: use {{read_dictionaries=false}}.

This affects Python, too.

I've attached a patch that fixes the issue for my file. I don't know how to 
formulate a reduction, though, so I haven't contributed unit tests. I'm also 
not certain how FinishInternal is meant to work, so this definitely needs 
expert review. (FinishInternal was _definitely_ buggy before my patch; after my 
patch it _might_ be buggy but I don't know.)

This message was sent by Atlassian Jira

Reply via email to