Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

J N Thu, 13 Jun 2024 13:30:32 -0700

>
> I suspect the reason for the difference is that pyarrow uses the datasets
> API internally (I'm pretty sure) even for single file reads now (this
> allows us to have consistent behavior).  This is also an asynchronous path
> internally.


I see thanks for explaining, does this mean that the c++ benchmark needs to
be configured better?
I read the architecture and Python indeed uses cython to run on c++. I
assumed that the defaults would be the same for both since I was using the
same latest version of arrow.


On Thu, Jun 13, 2024 at 5:56 AM Weston Pace <weston.p...@gmail.com> wrote:

> pyarrow uses c++ code internally.  With the large files I would guess that
> less than 0.1% of your pyarrow benchmark is spent in the python
> interpreter.
>
> Given this fact, my main advice is to not worry too much about the
> difference between pyarrow and carrow.  A lot of work goes into pyarrow to
> make sure it not only uses carrow efficiently but also picks the best
> carrow APIs and default configurations.
>
> I suspect the reason for the difference is that pyarrow uses the datasets
> API internally (I'm pretty sure) even for single file reads now (this
> allows us to have consistent behavior).  This is also an asynchronous path
> internally.  Even with OMP_NUM_THREADS=1 there still might be some parallel
> I/O going on (depends on how many row groups are in your file, etc.)
>
> When the file is small enough then the interpreter overhead is probably
> large enough to outweigh any benefits gained from the better configuration
> that pyarrow is doing.
>
> On Wed, Jun 12, 2024 at 10:32 PM J N <jaynarale3...@gmail.com> wrote:
>
> > Hello,
> >     We all know that there inherent overhead in Python, and we wanted to
> > compare the performance of reading data using C++ Arrow against PyArrow
> for
> > high throughput systems. Since I couldn't find any benchmarks online for
> > this comparison, I decided to create my own. These programs read a
> Parquet
> > file into arrow::Table in both C++ and Python, and are single threaded.
> >
> > Carrow benchmark -
> > https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
> > Pyarrow benchmark -
> > https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530
> >
> > Ps: I am new to arrow so some things might be inefficient in both
> >
> > They read a zstd compressed parquet file of around 300MB.
> > The results were very different than what we expected.
> > *Pyarrow*
> > Total time: 5.347517251968384 seconds
> >
> > *C++ Arrow*
> > Total time: 5.86806 seconds
> >
> > For smaller files however (0.5MB), c++ arrow was better
> >
> > *Pyarrow*
> > gzip
> > Total time: 0.013672113418579102 seconds
> >
> > *C++ Arrow*
> > Total time: 0.00501744 seconds
> > (carrow 10x better)
> >
> > So I have a question to the arrow experts, is this expected in the arrow
> > world or is there some error in my benchmark?
> >
> > Thank you!
> >
> >
> > --
> > Warm Regards,
> >
> > Jay Narale
> >
>


-- 
Warm Regards,

Jay Narale

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

Reply via email to