corwinjoy commented on issue #39676:
URL: https://github.com/apache/arrow/issues/39676#issuecomment-1901411117
@emkornfield @mapleFU
These are a lot of questions. I have started by running a more detailed
performance profile using the PR posted previously + a larger data set for
clarity. Details are below:
```
Performance profile:
Large table, all integer columns with the following properties:
parquet::WriterProperties::Builder builder;
builder.enable_write_page_index()->max_row_group_length(chunkSize)->disable_dictionary();
Built using debug settings for readability when profiling so this may
slightly distort things.
Benchmark is via perf:
> /usr/bin/perf record --freq=1000 --call-graph dwarf -q -o bm_reader_perf
/src/arrow/cpp/cmake-build-debug-arrow_debug/debug/parquet-internals-test
--gtest_filter=PageIndexBuilderTest.BenchmarkReader:PageIndexBuilderTest/*.BenchmarkReader:PageIndexBuilderTest.BenchmarkReader/*:*/PageIndexBuilderTest.BenchmarkReader/*:*/PageIndexBuilderTest/*.BenchmarkReader
--gtest_color=no
> perf report -i bm_reader_perf
Benchmark Results:
(nColumn=6000, nRow=10000), chunk_size=10, pages=1000, time with
index=BenchmarkIndexedRead=4.85429s, time with full
metadata=BenchmarkRegularRead=14.172s
At a high level, we have two benchmarking routines.
BenchmarkRegularRead - Opens the file as usual, reads full metadata, subsets
rowgroups using subset method. Reads subset using FileReaderBuilder with
metadata for rowgroup.
BenchmarkIndexedRead - Opens the file with "only rowgroup 0" metadata read,
reads OffsetIndex goes to target rowgroups via index. Reads subset using
FileReaderBuilder with metadata for rowgroup.
Benchmarks from perf:
- 63.83% 0.00% parquet-interna parquet-internals-test [.]
parquet::BenchmarkReadColumnsUsingOffsetIndex
- parquet::BenchmarkReadColumnsUsingOffsetIndex
▒
+ 48.77% parquet::BenchmarkRegularRead
▒
+ 15.07% parquet::BenchmarkIndexedRead
- 48.77% parquet::BenchmarkRegularRead
▒
+ 43.72% parquet::ReadMetaData
▒
+ 3.44% std::shared_ptr<parquet::FileMetaData>::~shared_ptr
▒
+ 1.59% parquet::ReadIndexedRow
- 48.77% parquet::BenchmarkRegularRead
▒
- 43.72% parquet::ReadMetaData
▒
parquet::ParquetFileReader::Open
▒
parquet::ParquetFileReader::Contents::Open
▒
- parquet::SerializedFile::ParseMetaData
▒
- 42.85%
parquet::SerializedFile::ParseUnencryptedFileMetadata
▒
parquet::FileMetaData::Make
▒
parquet::FileMetaData::FileMetaData
▒
-
parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl
▒
- 42.81%
parquet::ThriftDeserializer::DeserializeMessage<parquet::format::FileMetaData>
▒
parquet::ThriftDeserializer::DeserializeUnencryptedMessage<parquet::format::FileMetaData>
▒
- parquet::format::FileMetaData::read
▒
- 42.60% parquet::format::RowGroup::read
▒
+ 38.23% parquet::format::ColumnChunk::read
▒
+ 3.10%
std::vector<parquet::format::ColumnChunk,
std::allocator<parquet::format::ColumnChunk> >::resize ▒
+ 0.74%
arrow::io::internal::RandomAccessFileConcurrencyWrapper<arrow::io::ReadableFile>::ReadAt
▒
+ 3.44% std::shared_ptr<parquet::FileMetaData>::~shared_ptr
▒
+ 1.59% parquet::ReadIndexedRow
So, essentially, the regular read is spending a huge amount of time reading
all of the rowgroup metadata. In this example it
takes 40x to read the metadata than to read the actual row data.
In contrast, the IndexedRead only reads the first rowgroup, but then spends
a bit of extra time reading the OffsetIndex addresses.
(I have optimized the OffsetIndex reader).
- 15.07% parquet::BenchmarkIndexedRead
▒
- 11.67% parquet::ReadPageIndexesDirect
▒
+ 11.65% parquet::(anonymous
namespace)::PageIndexReaderImpl::GetAllOffsets # This is the offset index
reader ▒
+ 1.57% parquet::ReadIndexedRow
▒
+ 1.12% parquet::ReadMetaData
▒
+ 0.69%
std::vector<std::vector<std::shared_ptr<parquet::OffsetIndex>,
std::allocator<std::shared_ptr<parquet::OffsetIndex> > >, st▒
+ 48.77% 0.00% parquet-interna parquet-internals-test [.]
parquet::BenchmarkRegularRead
Breaking down the slow metadata read further it does seem that statistics
may play a big role:
- parquet::format::FileMetaData::read
▒
- 42.60% parquet::format::RowGroup::read
▒
- 38.23% parquet::format::ColumnChunk::read
▒
- 30.53%
parquet::format::ColumnMetaData::read
▒
+ 6.04%
parquet::format::Statistics::read
▒
+ 4.10%
parquet::format::PageEncodingStats::read
▒
+ 3.30%
apache::thrift::protocol::TProtocol::readFieldBegin
▒
+ 2.70%
std::vector<parquet::format::Encoding::type,
std::allocator<parquet::format::Encoding::type> >::▒
+ 2.40%
std::vector<parquet::format::PageEncodingStats,
std::allocator<parquet::format::PageEncodingStat▒
+ 1.98%
std::vector<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, std▒
+ 1.55%
apache::thrift::protocol::TProtocol::readI32
▒
+ 1.51%
apache::thrift::protocol::TProtocol::readListBegin
▒
+ 1.48%
apache::thrift::protocol::TProtocol::readI64
▒
+ 0.59%
apache::thrift::protocol::TProtocol::readString
▒
+ 2.22%
apache::thrift::protocol::TProtocol::readFieldBegin
▒
+ 1.63%
apache::thrift::protocol::TProtocol::readI64
▒
+ 0.88%
apache::thrift::protocol::TProtocol::readI32
▒
+ 3.10%
std::vector<parquet::format::ColumnChunk,
std::allocator<parquet::format::ColumnChunk> >::resize
Try the above test with builder.disable_statistics() to try and remove the
statistics.
Benchmark Results:
(nColumn=6000, nRow=10000), chunk_size=10, pages=1000, time with
index=4.57732s, time with full metadata=11.0561s
This is a lot better. Still, reading the metadata takes a long time:
- 62.17% 0.00% parquet-interna parquet-internals-test [.]
parquet::BenchmarkReadColumnsUsingOffsetIndex ▒
- parquet::BenchmarkReadColumnsUsingOffsetIndex
▒
- 44.23% parquet::BenchmarkRegularRead
▒
+ 38.88% parquet::ReadMetaData
▒
+ 3.52% std::shared_ptr<parquet::FileMetaData>::~shared_ptr
▒
+ 1.76% parquet::ReadIndexedRow
▒
+ 17.94% parquet::BenchmarkIndexedRead
The statistics are reduced but still a significant chunk:
- 44.23% parquet::BenchmarkRegularRead
▒
- 38.88% parquet::ReadMetaData
▒
parquet::ParquetFileReader::Open
▒
parquet::ParquetFileReader::Contents::Open
▒
- parquet::SerializedFile::ParseMetaData
▒
- 38.33%
parquet::SerializedFile::ParseUnencryptedFileMetadata
▒
parquet::FileMetaData::Make
▒
parquet::FileMetaData::FileMetaData
▒
-
parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl
▒
- 38.31%
parquet::ThriftDeserializer::DeserializeMessage<parquet::format::FileMetaData>
▒
parquet::ThriftDeserializer::DeserializeUnencryptedMessage<parquet::format::FileMetaData>
▒
- parquet::format::FileMetaData::read
▒
- 38.20% parquet::format::RowGroup::read
▒
- 33.54% parquet::format::ColumnChunk::read
▒
- 26.74%
parquet::format::ColumnMetaData::read
▒
+ 4.87%
parquet::format::PageEncodingStats::read
▒
+ 3.44%
apache::thrift::protocol::TProtocol::readFieldBegin
▒
+ 2.80%
std::vector<parquet::format::Encoding::type,
std::allocator<parquet::format::Encoding::type> >::▒
+ 2.57%
std::vector<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, std▒
+ 2.57%
std::vector<parquet::format::PageEncodingStats,
std::allocator<parquet::format::PageEncodingStat▒
+ 1.88%
apache::thrift::protocol::TProtocol::readI32
▒
+ 1.77%
apache::thrift::protocol::TProtocol::readListBegin
▒
+ 1.61%
apache::thrift::protocol::TProtocol::readI64
▒
+ 0.78%
apache::thrift::protocol::TProtocol::readString
▒
+ 1.84%
apache::thrift::protocol::TProtocol::readFieldBegin
▒
+ 1.19%
apache::thrift::protocol::TProtocol::readI64
▒
+ 3.83%
std::vector<parquet::format::ColumnChunk,
std::allocator<parquet::format::ColumnChunk> >::resize
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]