Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

J N Thu, 13 Jun 2024 14:58:03 -0700

I added these configs as you guys mentioned in the c++ code
>
>  parquet::ArrowReaderProperties arrow_reader_properties =
> parquet::default_arrow_reader_properties();
>  arrow_reader_properties.set_pre_buffer(true);
>  arrow_reader_properties.set_use_threads(true);
> parquet::ReaderProperties reader_properties =
> parquet::default_reader_properties();



Now the time improved and Carrow is much closer to pyarrow now

*Pyarrow*
https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530
Average time to read table: 4.3928395581245425 seconds

*Carrow*
https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
Average total time: 4.27191 seconds

I was evaluating if using C++ based parquet readers to train our ML models
is more efficient however If these numbers are true, there is no
strong usecase there.
Maybe I will add these in a public repo for others to reproduce and improve
on the benchmark to help them decide on c++ based use cases.



On Thu, Jun 13, 2024 at 10:56 AM J N <jaynarale3...@gmail.com> wrote:

> I will do an audit of the configs. Is there a way to get a dump of all?
> Also any particular candidate configs to look for?
>
> On Thu, Jun 13, 2024 at 10:37 AM wish maple <maplewish...@gmail.com>
> wrote:
>
>> Some configs, like use_thread would be true in Python but false in C++
>>
>> Maybe we call fill all configs explicitly with same values
>>
>> Best,
>> Xuwei Fu
>>
>> J N <jaynarale3...@gmail.com> 于2024年6月13日周四 13:32写道：
>>
>> > Hello,
>> >     We all know that there inherent overhead in Python, and we wanted to
>> > compare the performance of reading data using C++ Arrow against PyArrow
>> for
>> > high throughput systems. Since I couldn't find any benchmarks online for
>> > this comparison, I decided to create my own. These programs read a
>> Parquet
>> > file into arrow::Table in both C++ and Python, and are single threaded.
>> >
>> > Carrow benchmark -
>> > https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d
>> > Pyarrow benchmark -
>> > https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530
>> >
>> > Ps: I am new to arrow so some things might be inefficient in both
>> >
>> > They read a zstd compressed parquet file of around 300MB.
>> > The results were very different than what we expected.
>> > *Pyarrow*
>> > Total time: 5.347517251968384 seconds
>> >
>> > *C++ Arrow*
>> > Total time: 5.86806 seconds
>> >
>> > For smaller files however (0.5MB), c++ arrow was better
>> >
>> > *Pyarrow*
>> > gzip
>> > Total time: 0.013672113418579102 seconds
>> >
>> > *C++ Arrow*
>> > Total time: 0.00501744 seconds
>> > (carrow 10x better)
>> >
>> > So I have a question to the arrow experts, is this expected in the arrow
>> > world or is there some error in my benchmark?
>> >
>> > Thank you!
>> >
>> >
>> > --
>> > Warm Regards,
>> >
>> > Jay Narale
>> >
>>
> --
> Warm Regards,
>
> Jay Narale
>


-- 
Warm Regards,

Jay Narale

Re: [C++][Python] [Parquet] Parquet Reader C++ vs python benchmark

Reply via email to