Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

Weston Pace Thu, 13 Jun 2024 05:56:01 -0700

pyarrow uses c++ code internally.  With the large files I would guess that
less than 0.1% of your pyarrow benchmark is spent in the python interpreter.

Given this fact, my main advice is to not worry too much about the
difference between pyarrow and carrow.  A lot of work goes into pyarrow to
make sure it not only uses carrow efficiently but also picks the best
carrow APIs and default configurations.

I suspect the reason for the difference is that pyarrow uses the datasets
API internally (I'm pretty sure) even for single file reads now (this
allows us to have consistent behavior).  This is also an asynchronous path
internally.  Even with OMP_NUM_THREADS=1 there still might be some parallel
I/O going on (depends on how many row groups are in your file, etc.)

When the file is small enough then the interpreter overhead is probably
large enough to outweigh any benefits gained from the better configuration
that pyarrow is doing.

On Wed, Jun 12, 2024 at 10:32 PM J N <jaynarale3...@gmail.com> wrote:

> Hello,
>     We all know that there inherent overhead in Python, and we wanted to
> compare the performance of reading data using C++ Arrow against PyArrow for
> high throughput systems. Since I couldn't find any benchmarks online for
> this comparison, I decided to create my own. These programs read a Parquet
> file into arrow::Table in both C++ and Python, and are single threaded.
>
> Carrow benchmark -
> https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
> Pyarrow benchmark -
> https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530
>
> Ps: I am new to arrow so some things might be inefficient in both
>
> They read a zstd compressed parquet file of around 300MB.
> The results were very different than what we expected.
> *Pyarrow*
> Total time: 5.347517251968384 seconds
>
> *C++ Arrow*
> Total time: 5.86806 seconds
>
> For smaller files however (0.5MB), c++ arrow was better
>
> *Pyarrow*
> gzip
> Total time: 0.013672113418579102 seconds
>
> *C++ Arrow*
> Total time: 0.00501744 seconds
> (carrow 10x better)
>
> So I have a question to the arrow experts, is this expected in the arrow
> world or is there some error in my benchmark?
>
> Thank you!
>
>
> --
> Warm Regards,
>
> Jay Narale
>

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

Reply via email to