[PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Keith Chapman Thu, 12 Jan 2017 16:22:02 -0800

Hi,

I'm using the the parquet-cpp library to read in some parquet files. I seen
that the parquet-cpp library has support for arrow and hence I thought of
giving it a shot. When running experiments I did not see any significant
increase in performance hence I was taking a look at the code. It looks to
me like the arrow reader uses and intermediate buffer to store the data and
hence does an extra copy, is this because of the mismatch in data types
between parquet and arrow? I'm specifically refering to the
FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276).
Also I would imagine that setting one bit at a time would be inefficient,
not too sure if the compiler would be smart enough to set a work at a time
(I doubt it though). Just wondering if there was a reason behind having the
code as it is.


[1]
https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/reader.cc


Regards,
Keith.

http://keith-chapman.com

[PARQUET_CPP] Why does the arrow reader in parquet does an extra copy?

Reply via email to