Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Keith Chapman Thu, 12 Jan 2017 17:21:06 -0800

Cool, Thanks for the update Wes. I was wondering if there was some deign
issue I was not aware of :). I will keep my eyes on the PR and llok to make
more optimizations and upstream it.


Regards,
Keith.

http://keith-chapman.com

On Thu, Jan 12, 2017 at 5:15 PM, Wes McKinney <[email protected]> wrote:

> hi Keith
>
> Uwe is working on this right now (avoiding the extra copy):
>
> https://github.com/apache/parquet-cpp/pull/218
>
> We would appreciate any efforts to further optimize these code paths.
>
> Thanks
> Wes
>
> On Thu, Jan 12, 2017 at 7:21 PM, Keith Chapman <[email protected]>
> wrote:
> > Hi,
> >
> > I'm using the the parquet-cpp library to read in some parquet files. I
> seen
> > that the parquet-cpp library has support for arrow and hence I thought of
> > giving it a shot. When running experiments I did not see any significant
> > increase in performance hence I was taking a look at the code. It looks
> to
> > me like the arrow reader uses and intermediate buffer to store the data
> and
> > hence does an extra copy, is this because of the mismatch in data types
> > between parquet and arrow? I'm specifically refering to the
> > FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276).
> > Also I would imagine that setting one bit at a time would be inefficient,
> > not too sure if the compiler would be smart enough to set a work at a
> time
> > (I doubt it though). Just wondering if there was a reason behind having
> the
> > code as it is.
> >
> > [1]
> > https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/arrow/reader.cc
> >
> >
> > Regards,
> > Keith.
> >
> > http://keith-chapman.com
>

Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Reply via email to