Hi Keith, just a small heads up: the pull request for the read path is merged, I'm currently looking into removing all those copies in the write as well.
Cheers Uwe On Fri, Jan 13, 2017, at 02:20 AM, Keith Chapman wrote: > Cool, Thanks for the update Wes. I was wondering if there was some deign > issue I was not aware of :). I will keep my eyes on the PR and llok to > make > more optimizations and upstream it. > > Regards, > Keith. > > http://keith-chapman.com > > On Thu, Jan 12, 2017 at 5:15 PM, Wes McKinney <[email protected]> > wrote: > > > hi Keith > > > > Uwe is working on this right now (avoiding the extra copy): > > > > https://github.com/apache/parquet-cpp/pull/218 > > > > We would appreciate any efforts to further optimize these code paths. > > > > Thanks > > Wes > > > > On Thu, Jan 12, 2017 at 7:21 PM, Keith Chapman <[email protected]> > > wrote: > > > Hi, > > > > > > I'm using the the parquet-cpp library to read in some parquet files. I > > seen > > > that the parquet-cpp library has support for arrow and hence I thought of > > > giving it a shot. When running experiments I did not see any significant > > > increase in performance hence I was taking a look at the code. It looks > > to > > > me like the arrow reader uses and intermediate buffer to store the data > > and > > > hence does an extra copy, is this because of the mismatch in data types > > > between parquet and arrow? I'm specifically refering to the > > > FlatColumnReader::Impl::ReadNullableFlatBatch method in [1] (line 276). > > > Also I would imagine that setting one bit at a time would be inefficient, > > > not too sure if the compiler would be smart enough to set a work at a > > time > > > (I doubt it though). Just wondering if there was a reason behind having > > the > > > code as it is. > > > > > > [1] > > > https://github.com/apache/parquet-cpp/blob/master/src/ > > parquet/arrow/reader.cc > > > > > > > > > Regards, > > > Keith. > > > > > > http://keith-chapman.com > >
