+1 on a repo for the spec. I do have questions as well. In particular for the metadata.
On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <[email protected]> wrote: > On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <[email protected]> wrote: > > > > > > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <[email protected]> wrote: > >> > >> > >> > > >> > As far as the existing work is concerned, I'm not sure everyone is > aware > >> > of > >> > the C++ code inside of Drill that can represent at least the scalar > >> > types in > >> > Drill's existing Value Vectors [1]. This is currently used by the > native > >> > client written to hook up an ODBC driver. > >> > > >> > >> I have read this code. From my perspective, it would be less work to > >> collaborate on a self-contained implementation that closely models the > >> Arrow / VV spec that includes builder classes and its own memory > >> management without coupling to Drill details. I started prototyping > >> something here (warning: only a few actual days of coding here): > >> > >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow > >> > >> For example, you can see an example constructing an Array<Int32> or > >> String (== Array<UInt8>) column in the tests here > >> > >> > >> > https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328 > >> > >> I've been planning to use this as the basis of a C++ Parquet > >> reader-writer and the associated Python pandas-like layer which > >> includes in-memory analytics on Arrow data structures. > >> > >> > Parth who is included here has been the primary owner of this C++ code > >> > throughout it's life in Drill. Parth, what do you think is the best > >> > strategy > >> > for managing the C++ code right now? As the C++ build is not tied into > >> > the > >> > Java one, as I understand it we just run it manually when updates are > >> > made > >> > there and we need to update ODBC. Would it be disruptive to move the > >> > code to > >> > the arrow repo? If so, we could include Drill as a submodule in the > new > >> > repo, or put Wes's work so far in the Drill repo. > >> > >> If we can enumerate the non-Drill-client related parts (i.e. the array > >> accessors and data structures-oriented code) that would make sense in > >> a standalone Arrow library it would be great to start a side > >> discussion about the design of the C++ reference implementation > >> (metadata / schemas, IPC, array builders and accessors, etc.). Since > >> this is a quite urgent for me (intending to deliver a minimally viable > >> pandas-like Arrow + Parquet in Python stack in the next ~3 months) it > >> would be great to do this sooner rather than later. > >> > > > > Most of the code for Drill C++ Value Vectors is independent of Drill - > > mostly the code upto line 787 in this file - > > > https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp > > > > My thought was to leave the Drill implementation alone and borrow > copiously > > from it when convenient for Arrow. Seems like we can still do that > building > > on Wes' work. > > > > Makes sense. Speaking of code, would you all like me to set up a > temporary repo for the specification itself? I already have a few > questions like how and where to track array null counts. > > > Wes, let me know if you want to have a quick hangout on this. > > > > Sure, I'll follow up separately to get something on the calendar. > Looking forward to connecting! > > > Parth > > > > > -- Julien
