Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

J N Thu, 13 Jun 2024 10:56:35 -0700

I will do an audit of the configs. Is there a way to get a dump of all?
Also any particular candidate configs to look for?


On Thu, Jun 13, 2024 at 10:37 AM wish maple <[email protected]> wrote:

> Some configs, like use_thread would be true in Python but false in C++
>
> Maybe we call fill all configs explicitly with same values
>
> Best,
> Xuwei Fu
>
> J N <[email protected]> 于2024年6月13日周四 13:32写道：
>
> > Hello,
> >     We all know that there inherent overhead in Python, and we wanted to
> > compare the performance of reading data using C++ Arrow against PyArrow
> for
> > high throughput systems. Since I couldn't find any benchmarks online for
> > this comparison, I decided to create my own. These programs read a
> Parquet
> > file into arrow::Table in both C++ and Python, and are single threaded.
> >
> > Carrow benchmark -
> > https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
> > Pyarrow benchmark -
> > https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530
> >
> > Ps: I am new to arrow so some things might be inefficient in both
> >
> > They read a zstd compressed parquet file of around 300MB.
> > The results were very different than what we expected.
> > *Pyarrow*
> > Total time: 5.347517251968384 seconds
> >
> > *C++ Arrow*
> > Total time: 5.86806 seconds
> >
> > For smaller files however (0.5MB), c++ arrow was better
> >
> > *Pyarrow*
> > gzip
> > Total time: 0.013672113418579102 seconds
> >
> > *C++ Arrow*
> > Total time: 0.00501744 seconds
> > (carrow 10x better)
> >
> > So I have a question to the arrow experts, is this expected in the arrow
> > world or is there some error in my benchmark?
> >
> > Thank you!
> >
> >
> > --
> > Warm Regards,
> >
> > Jay Narale
> >
>
-- 
Warm Regards,

Jay Narale

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

Reply via email to