On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <w...@cloudera.com> wrote:
> > > > > As far as the existing work is concerned, I'm not sure everyone is aware > of > > the C++ code inside of Drill that can represent at least the scalar > types in > > Drill's existing Value Vectors [1]. This is currently used by the native > > client written to hook up an ODBC driver. > > > > I have read this code. From my perspective, it would be less work to > collaborate on a self-contained implementation that closely models the > Arrow / VV spec that includes builder classes and its own memory > management without coupling to Drill details. I started prototyping > something here (warning: only a few actual days of coding here): > > https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow > > For example, you can see an example constructing an Array<Int32> or > String (== Array<UInt8>) column in the tests here > > > https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328 > > I've been planning to use this as the basis of a C++ Parquet > reader-writer and the associated Python pandas-like layer which > includes in-memory analytics on Arrow data structures. > > > Parth who is included here has been the primary owner of this C++ code > > throughout it's life in Drill. Parth, what do you think is the best > strategy > > for managing the C++ code right now? As the C++ build is not tied into > the > > Java one, as I understand it we just run it manually when updates are > made > > there and we need to update ODBC. Would it be disruptive to move the > code to > > the arrow repo? If so, we could include Drill as a submodule in the new > > repo, or put Wes's work so far in the Drill repo. > > If we can enumerate the non-Drill-client related parts (i.e. the array > accessors and data structures-oriented code) that would make sense in > a standalone Arrow library it would be great to start a side > discussion about the design of the C++ reference implementation > (metadata / schemas, IPC, array builders and accessors, etc.). Since > this is a quite urgent for me (intending to deliver a minimally viable > pandas-like Arrow + Parquet in Python stack in the next ~3 months) it > would be great to do this sooner rather than later. > > Most of the code for Drill C++ Value Vectors is independent of Drill - mostly the code upto line 787 in this file - https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp My thought was to leave the Drill implementation alone and borrow copiously from it when convenient for Arrow. Seems like we can still do that building on Wes' work. Wes, let me know if you want to have a quick hangout on this. Parth