[
https://issues.apache.org/jira/browse/ARROW-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617042#comment-17617042
]
Antoine Pitrou edited comment on ARROW-18031 at 10/13/22 1:12 PM:
------------------------------------------------------------------
The concrete issue here seems to be that {{BitReader::GetAligned<bool>}}
doesn't take the expected bit width, and tries to copy blindly the encoded
bytes into the {{bool*}} output buffer.
I actually don't understand how the tests work at all given this seems clearly
broken, unless other Parquet writers happen to make a similar mistake?
was (Author: pitrou):
The concrete issue here seems to be that {{BitReader::GetAligned<bool>}}
doesn't take the expected bit width, and tries to copy blindly the encoded
bytes into the {{bool*}} output buffer.
I actually don't understand how the tests work at all given this seems clearly
broken.
> [C++][Parquet] Undefined behavior in boolean RLE decoder
> --------------------------------------------------------
>
> Key: ARROW-18031
> URL: https://issues.apache.org/jira/browse/ARROW-18031
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet
> Reporter: Antoine Pitrou
> Priority: Critical
> Fix For: 10.0.0
>
>
> A fuzzing run found this undefined behavior, which hints that the RLE boolean
> decoder implementation is wrong:
> {code}
> #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> #1 0x00007ffff7a45859 in __GI_abort () at abort.c:79
> #2 0x000055555beafa07 in __sanitizer::Abort() ()
> #3 0x000055555bead8a1 in __sanitizer::Die() ()
> #4 0x000055555bec15cc in __ubsan::ScopedReport::~ScopedReport() ()
> #5 0x000055555bec437b in handleLoadInvalidValue(__ubsan::InvalidValueData*,
> unsigned long, __ubsan::ReportOptions) ()
> #6 0x000055555bec43be in __ubsan_handle_load_invalid_value_abort ()
> #7 0x000055555c5acb9b in arrow::bit_util::BitReader::GetAligned<bool>
> (this=0x607000001060, num_bytes=1, v=0x7fffffff99d0)
> at /home/antoine/arrow/dev/cpp/src/arrow/util/bit_stream_utils.h:415
> #8 0x000055555c5aa7d4 in arrow::util::RleDecoder::NextCounts<bool>
> (this=0x607000001060) at
> /home/antoine/arrow/dev/cpp/src/arrow/util/rle_encoding.h:663
> #9 0x000055555c5a7328 in arrow::util::RleDecoder::GetBatch<bool>
> (this=0x607000001060, values=0x7ffff5408000, batch_size=2089)
> at /home/antoine/arrow/dev/cpp/src/arrow/util/rle_encoding.h:329
> #10 0x000055555c59834e in parquet::(anonymous
> namespace)::RleBooleanDecoder::Decode (this=0x606000003ce0,
> buffer=0x7ffff5408000, max_values=2089)
> at /home/antoine/arrow/dev/cpp/src/parquet/encoding.cc:2388
> #11 0x000055555c4f43d9 in parquet::internal::(anonymous
> namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)0>
> >::ReadValuesDense (
> this=0x614000001050, values_to_read=2089) at
> /home/antoine/arrow/dev/cpp/src/parquet/column_reader.cc:1531
> #12 0x000055555c4f7668 in parquet::internal::(anonymous
> namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)0>
> >::ReadRecordData (
> this=0x614000001050, num_records=2089) at
> /home/antoine/arrow/dev/cpp/src/parquet/column_reader.cc:1575
> #13 0x000055555c4f03e5 in parquet::internal::(anonymous
> namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)0>
> >::ReadRecords (
> this=0x614000001050, num_records=2089) at
> /home/antoine/arrow/dev/cpp/src/parquet/column_reader.cc:1331
> #14 0x000055555bf0acee in parquet::arrow::(anonymous
> namespace)::LeafReader::LoadBatch (this=0x608000001020, records_to_read=2089)
> at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:479
> #15 0x000055555bf019df in parquet::arrow::ColumnReaderImpl::NextBatch
> (this=0x608000001020, batch_size=2089, out=0x7fffffffb740)
> at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:109
> #16 0x000055555bf78829 in parquet::arrow::(anonymous
> namespace)::FileReaderImpl::ReadColumn (this=0x613000001a80, i=0,
> row_groups=std::vector of length 1, capacity 1 = {...},
> reader=0x608000001020, out=0x7fffffffb740)
> at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:285
> #17 0x000055555bff1b9c in parquet::arrow::(anonymous
> namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous
> namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&,
> std::vector<int, std::allocator<int> > const&,
> arrow::internal::Executor*)::$_4::operator()(unsigned long,
> std::shared_ptr<parquet::arrow::ColumnReaderImpl>) const
> (this=0x7fffffffbdc0, i=0, reader=warning: RTTI symbol not found for class
> 'std::_Sp_counted_deleter<parquet::arrow::ColumnReaderImpl*,
> std::default_delete<parquet::arrow::ColumnReaderImpl>, std::allocator<void>,
> (__gnu_cxx::_Lock_policy)2>'
> warning: RTTI symbol not found for class
> 'std::_Sp_counted_deleter<parquet::arrow::ColumnReaderImpl*,
> std::default_delete<parquet::arrow::ColumnReaderImpl>, std::allocator<void>,
> (__gnu_cxx::_Lock_policy)2>'
> std::shared_ptr<parquet::arrow::ColumnReaderImpl> (use count 2, weak count 0)
> = {...}) at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1236
> #18 0x000055555bfed49d in
> arrow::internal::OptionalParallelForAsync<parquet::arrow::(anonymous
> namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous
> namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&,
> std::vector<int, std::allocator<int> > const&,
> arrow::internal::Executor*)::$_4&,
> std::shared_ptr<parquet::arrow::ColumnReaderImpl>,
> std::shared_ptr<arrow::ChunkedArray> >(bool,
> std::vector<std::shared_ptr<parquet::arrow::ColumnReaderImpl>,
> std::allocator<std::shared_ptr<parquet::arrow::ColumnReaderImpl> > >,
> parquet::arrow::(anonymous
> namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous
> namespace)::FileReaderImpl>, std::vector<int, std::allocator<int> > const&,
> std::vector<int, std::allocator<int> > const&,
> arrow::internal::Executor*)::$_4&, arrow::internal::Executor*)
> (use_threads=false, inputs=std::vector of length 1, capacity 1 = {...},
> func=..., executor=0x604000002b90)
> at /home/antoine/arrow/dev/cpp/src/arrow/util/parallel.h:95
> #19 0x000055555bfebe4c in parquet::arrow::(anonymous
> namespace)::FileReaderImpl::DecodeRowGroups (this=0x613000001a80,
> self=std::shared_ptr<parquet::arrow::(anonymous
> namespace)::FileReaderImpl> (empty) = {...}, row_groups=std::vector of length
> 1, capacity 1 = {...},
> column_indices=std::vector of length 1, capacity 1 = {...},
> cpu_executor=0x604000002b90) at
> /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1254
> #20 0x000055555bee0d57 in parquet::arrow::(anonymous
> namespace)::FileReaderImpl::ReadRowGroups (this=0x613000001a80,
> row_groups=std::vector of length 1, capacity 1 = {...},
> column_indices=std::vector of length 1, capacity 1 = {...},
> out=0x7fffffffc880)
> at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1215
> #21 0x000055555bedf948 in parquet::arrow::(anonymous
> namespace)::FileReaderImpl::ReadRowGroup (this=0x613000001a80,
> row_group_index=0,
> column_indices=std::vector of length 1, capacity 1 = {...},
> out=0x7fffffffc880) at
> /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:322
> #22 0x000055555bedfe9c in parquet::arrow::(anonymous
> namespace)::FileReaderImpl::ReadRowGroup (this=0x613000001a80, i=0,
> table=0x7fffffffc880)
> at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:326
> #23 0x000055555becf902 in parquet::arrow::internal::FuzzReader
> (reader=std::unique_ptr<parquet::arrow::FileReader> = {...})
> at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1338
> #24 0x000055555bed0f66 in parquet::arrow::internal::FuzzReader
> (data=0x60e000000e40 " \377 \025", size=159)
> at /home/antoine/arrow/dev/cpp/src/parquet/arrow/reader.cc:1355
> #25 0x000055555bec8b78 in LLVMFuzzerTestOneInput (data=0x60e000000e40 " \377
> \025", size=159) at /home/antoine/arrow/dev/cpp/src/parquet/arrow/fuzz.cc:22
> #26 0x000055555bdef964 in fuzzer::Fuzzer::ExecuteCallback(unsigned char
> const*, unsigned long) ()
> #27 0x000055555bdd9d30 in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*,
> unsigned long) ()
> #28 0x000055555bddfa37 in fuzzer::FuzzerDriver(int*, char***, int
> (*)(unsigned char const*, unsigned long)) ()
> #29 0x000055555be09053 in main ()
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)