I added these configs as you guys mentioned in the c++ code > > parquet::ArrowReaderProperties arrow_reader_properties = > parquet::default_arrow_reader_properties(); > arrow_reader_properties.set_pre_buffer(true); > arrow_reader_properties.set_use_threads(true); > parquet::ReaderProperties reader_properties = > parquet::default_reader_properties();
Now the time improved and Carrow is much closer to pyarrow now *Pyarrow* https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530 Average time to read table: 4.3928395581245425 seconds *Carrow* https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d Average total time: 4.27191 seconds I was evaluating if using C++ based parquet readers to train our ML models is more efficient however If these numbers are true, there is no strong usecase there. Maybe I will add these in a public repo for others to reproduce and improve on the benchmark to help them decide on c++ based use cases. On Thu, Jun 13, 2024 at 10:56 AM J N <jaynarale3...@gmail.com> wrote: > I will do an audit of the configs. Is there a way to get a dump of all? > Also any particular candidate configs to look for? > > On Thu, Jun 13, 2024 at 10:37 AM wish maple <maplewish...@gmail.com> > wrote: > >> Some configs, like use_thread would be true in Python but false in C++ >> >> Maybe we call fill all configs explicitly with same values >> >> Best, >> Xuwei Fu >> >> J N <jaynarale3...@gmail.com> 于2024年6月13日周四 13:32写道: >> >> > Hello, >> > We all know that there inherent overhead in Python, and we wanted to >> > compare the performance of reading data using C++ Arrow against PyArrow >> for >> > high throughput systems. Since I couldn't find any benchmarks online for >> > this comparison, I decided to create my own. These programs read a >> Parquet >> > file into arrow::Table in both C++ and Python, and are single threaded. >> > >> > Carrow benchmark - >> > https://gist.github.com/jaystarshot/9608bf4b9fdd399c1658d71328ce2c6d >> > Pyarrow benchmark - >> > https://gist.github.com/jaystarshot/451f97b75e9750b1f00d157e6b9b3530 >> > >> > Ps: I am new to arrow so some things might be inefficient in both >> > >> > They read a zstd compressed parquet file of around 300MB. >> > The results were very different than what we expected. >> > *Pyarrow* >> > Total time: 5.347517251968384 seconds >> > >> > *C++ Arrow* >> > Total time: 5.86806 seconds >> > >> > For smaller files however (0.5MB), c++ arrow was better >> > >> > *Pyarrow* >> > gzip >> > Total time: 0.013672113418579102 seconds >> > >> > *C++ Arrow* >> > Total time: 0.00501744 seconds >> > (carrow 10x better) >> > >> > So I have a question to the arrow experts, is this expected in the arrow >> > world or is there some error in my benchmark? >> > >> > Thank you! >> > >> > >> > -- >> > Warm Regards, >> > >> > Jay Narale >> > >> > -- > Warm Regards, > > Jay Narale > -- Warm Regards, Jay Narale