Hello, We all know that there inherent overhead in Python, and we wanted to compare the performance of reading data using C++ Arrow against PyArrow for high throughput systems. Since I couldn't find any benchmarks online for this comparison, I decided to create my own. These programs read a Parquet file into arrow::Table in both C++ and Python, and are single threaded.
Carrow benchmark - https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d Pyarrow benchmark - https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530 Ps: I am new to arrow so some things might be inefficient in both They read a zstd compressed parquet file of around 300MB. The results were very different than what we expected. *Pyarrow* Total time: 5.347517251968384 seconds *C++ Arrow* Total time: 5.86806 seconds For smaller files however (0.5MB), c++ arrow was better *Pyarrow* gzip Total time: 0.013672113418579102 seconds *C++ Arrow* Total time: 0.00501744 seconds (carrow 10x better) So I have a question to the arrow experts, is this expected in the arrow world or is there some error in my benchmark? Thank you! -- Warm Regards, Jay Narale