> > I suspect the reason for the difference is that pyarrow uses the datasets > API internally (I'm pretty sure) even for single file reads now (this > allows us to have consistent behavior). This is also an asynchronous path > internally.
I see thanks for explaining, does this mean that the c++ benchmark needs to be configured better? I read the architecture and Python indeed uses cython to run on c++. I assumed that the defaults would be the same for both since I was using the same latest version of arrow. On Thu, Jun 13, 2024 at 5:56 AM Weston Pace <weston.p...@gmail.com> wrote: > pyarrow uses c++ code internally. With the large files I would guess that > less than 0.1% of your pyarrow benchmark is spent in the python > interpreter. > > Given this fact, my main advice is to not worry too much about the > difference between pyarrow and carrow. A lot of work goes into pyarrow to > make sure it not only uses carrow efficiently but also picks the best > carrow APIs and default configurations. > > I suspect the reason for the difference is that pyarrow uses the datasets > API internally (I'm pretty sure) even for single file reads now (this > allows us to have consistent behavior). This is also an asynchronous path > internally. Even with OMP_NUM_THREADS=1 there still might be some parallel > I/O going on (depends on how many row groups are in your file, etc.) > > When the file is small enough then the interpreter overhead is probably > large enough to outweigh any benefits gained from the better configuration > that pyarrow is doing. > > On Wed, Jun 12, 2024 at 10:32 PM J N <jaynarale3...@gmail.com> wrote: > > > Hello, > > We all know that there inherent overhead in Python, and we wanted to > > compare the performance of reading data using C++ Arrow against PyArrow > for > > high throughput systems. Since I couldn't find any benchmarks online for > > this comparison, I decided to create my own. These programs read a > Parquet > > file into arrow::Table in both C++ and Python, and are single threaded. > > > > Carrow benchmark - > > https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d > > Pyarrow benchmark - > > https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530 > > > > Ps: I am new to arrow so some things might be inefficient in both > > > > They read a zstd compressed parquet file of around 300MB. > > The results were very different than what we expected. > > *Pyarrow* > > Total time: 5.347517251968384 seconds > > > > *C++ Arrow* > > Total time: 5.86806 seconds > > > > For smaller files however (0.5MB), c++ arrow was better > > > > *Pyarrow* > > gzip > > Total time: 0.013672113418579102 seconds > > > > *C++ Arrow* > > Total time: 0.00501744 seconds > > (carrow 10x better) > > > > So I have a question to the arrow experts, is this expected in the arrow > > world or is there some error in my benchmark? > > > > Thank you! > > > > > > -- > > Warm Regards, > > > > Jay Narale > > > -- Warm Regards, Jay Narale