[C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

J N Wed, 12 Jun 2024 22:32:15 -0700

Hello,
    We all know that there inherent overhead in Python, and we wanted to
compare the performance of reading data using C++ Arrow against PyArrow for
high throughput systems. Since I couldn't find any benchmarks online for
this comparison, I decided to create my own. These programs read a Parquet
file into arrow::Table in both C++ and Python, and are single threaded.


Carrow benchmark -
https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
Pyarrow benchmark -
https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530

Ps: I am new to arrow so some things might be inefficient in both

They read a zstd compressed parquet file of around 300MB.
The results were very different than what we expected.
*Pyarrow*
Total time: 5.347517251968384 seconds

*C++ Arrow*
Total time: 5.86806 seconds

For smaller files however (0.5MB), c++ arrow was better

*Pyarrow*
gzip
Total time: 0.013672113418579102 seconds

*C++ Arrow*
Total time: 0.00501744 seconds
(carrow 10x better)

So I have a question to the arrow experts, is this expected in the arrow
world or is there some error in my benchmark?

Thank you!


-- 
Warm Regards,

Jay Narale

[C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

Reply via email to