WillAyd opened a new issue, #41224:
URL: https://github.com/apache/arrow/issues/41224
### Describe the enhancement requested
The performance behavior for reading binary data types through parquet is
much different than say integral types. While of course these aren't expected
to be identical, I was surprised to see a lot of Append calls in a performance
trace of the parquet reader with strings.
To illustrate, I have created the following data:
```python
import pyarrow as pa
import pyarrow.parquet as pq
tbl1 = pa.Table.from_pydict({"col": range(10_000_000)})
pq.write_table(tbl1, "ints.parquet")
tbl2 = pa.Table.from_pydict({"col": ["foo", "bar"] * 5_000_000})
pq.write_table(tbl2, "strings.parquet")
```
And written two simple benchmarks against these files. read_ints.py:
```
import pyarrow.parquet as pq
for _ in range(10):
pq.read_table("ints.parquet")
```
and read_strings.py
```python
import pyarrow.parquet as pq
for _ in range(10):
pq.read_table("strings.parquet")
```
When executing these under callgrind, here is what I see for the integer
benchmark:
```
10,978,640,541 (55.79%) ???:std::pair<unsigned char const*, long>
snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*,
long, char*, long)
[/home/willayd/mambaforge/envs/scratchpad/lib/libsnappy.so.1.2.0]
2,594,207,330 (13.18%) ???:snappy::MemCopy64(char*, void const*, unsigned
long) [/home/willayd/mambaforge/envs/scratchpad/lib/libsnappy.so.1.2.0]
2,395,482,666 (12.17%)
./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
[/usr/lib/x86_64-linux-gnu/libc.so.6]
810,937,500 ( 4.12%) ???:parquet::internal::GreaterThanBitmapAvx2(short
const*, long, short)
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
598,618,440 ( 3.04%) ???:snappy::DeferMemCopy(void const**, unsigned
long*, void const*, unsigned long)
[/home/willayd/mambaforge/envs/scratchpad/lib/libsnappy.so.1.2.0]
198,977,226 ( 1.01%)
./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:memcpy@GLIBC_2.2.5
[/usr/lib/x86_64-linux-gnu/libc.so.6]
177,894,713 ( 0.90%)
/usr/local/src/conda/python-3.12.1/Modules/_sre/sre_lib.h:sre_ucs1_match
[/home/willayd/mambaforge/envs/scratchpad/bin/python3.12]
133,685,200 ( 0.68%) ???:int
arrow::util::RleDecoder::GetBatchWithDict<long>(long const*, int, long*, int)
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
117,187,500 ( 0.60%) ???:long
parquet::internal::standard::DefLevelsBatchToBitmap<false>(short const*, long,
long, parquet::internal::LevelInfo, arrow::internal::FirstTimeBitmapWriter*)
[clone .isra.0]
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
77,081,177 ( 0.39%) ./elf/./elf/dl-lookup.c:do_lookup_x
[/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2]
```
versus with strings
```
7,100,000,000 (33.57%)
???:arrow::BaseBinaryBuilder<arrow::BinaryType>::Append(unsigned char const*,
int) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
3,300,018,590 (15.60%) ???:arrow::BufferBuilder::Append(void const*, long)
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
3,300,004,000 (15.60%) ???:arrow::ArrayBuilder::Reserve(long)
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
2,601,226,650 (12.30%) ???:parquet::(anonymous
namespace)::DictByteArrayDecoderImpl::DecodeArrowDenseNonNull(int,
parquet::EncodingTraits<parquet::PhysicalType<(parquet::Type::type)6>
>::Accumulator*, int*) [clone .constprop.0]
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
1,671,172,008 ( 7.90%)
./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:memcpy@GLIBC_2.2.5
[/usr/lib/x86_64-linux-gnu/libc.so.6]
810,937,500 ( 3.83%) ???:parquet::internal::GreaterThanBitmapAvx2(short
const*, long, short)
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
200,000,204 ( 0.95%) ???:arrow::ArrayBuilder::length() const
[/home/willayd/mambaforge/envs/scratchpad/lib/libarrow.so.1500.2.0]
177,894,713 ( 0.84%)
/usr/local/src/conda/python-3.12.1/Modules/_sre/sre_lib.h:sre_ucs1_match
[/home/willayd/mambaforge/envs/scratchpad/bin/python3.12]
140,038,870 ( 0.66%) ???:int
arrow::bit_util::BitReader::GetBatch<int>(int, int*, int)
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
117,187,500 ( 0.55%) ???:long
parquet::internal::standard::DefLevelsBatchToBitmap<false>(short const*, long,
long, parquet::internal::LevelInfo, arrow::internal::FirstTimeBitmapWriter*)
[clone .isra.0]
[/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
```
Is the string reader expected to be so heavy on the Append? I am by no means
an expert in the parquet format, but I believe that there is a
`total_uncompressed_size` in the column metadata that might be useable to
pre-allocate the buffer for binary data so that we don't have to spend as much
time in Append calls
The above IR is from running the benchmarks on pyarrow 15.0.2
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]