Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Wes McKinney Thu, 12 Jan 2017 17:17:28 -0800

hi Keith

Uwe is working on this right now (avoiding the extra copy):


https://github.com/apache/parquet-cpp/pull/218

We would appreciate any efforts to further optimize these code paths.

Thanks
Wes

On Thu, Jan 12, 2017 at 7:21 PM, Keith Chapman <[email protected]> wrote:
> Hi,
>
> I'm using the the parquet-cpp library to read in some parquet files. I seen
> that the parquet-cpp library has support for arrow and hence I thought of
> giving it a shot. When running experiments I did not see any significant
> increase in performance hence I was taking a look at the code. It looks to
> me like the arrow reader uses and intermediate buffer to store the data and
> hence does an extra copy, is this because of the mismatch in data types
> between parquet and arrow? I'm specifically refering to the
> FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276).
> Also I would imagine that setting one bit at a time would be inefficient,
> not too sure if the compiler would be smart enough to set a work at a time
> (I doubt it though). Just wondering if there was a reason behind having the
> code as it is.
>
> [1]
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc
>
>
> Regards,
> Keith.
>
> http://keith-chapman.com

Re: [PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Reply via email to