I'm available in SF on the 16th from 2pm onward -- location flexible. Perhaps we can run a hangout in case anyone outside of SF wants to listen to or participate in the discussion.
It's an exciting application so looking forward to finding a way to collaborate and achieve common goals. Thanks Wes On Sat, Aug 6, 2016 at 10:01 AM, Julien Le Dem <[email protected]> wrote: > Hi Jim, Wes, > I'd be happy to join on the 16th as well. I'm not far from the financial > district. I can book a room there. > Julien > > On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote: >> >> Hi Wes (and the Parquet team), >> >> I've just confirmed that I'm free from 1:45pm onward on Tuesday August 16. >> I'd love to talk with you and anyone else from the team about reading >> physics data into Arrow and Parquet in C++. Let me know if there's a time >> and place that works for you. I'll be coming from Ellis and Mason Street >> near the Financial District. >> >> I started talking about this within my community, and have already found a >> dozen people who have been thinking along these lines: data exporters from >> the ROOT file format to various Big Data and machine learning tools. We're >> organizing ourselves to consolidate this effort. Three members of this >> group are setting up a Spark cluster at CERN for centralized data >> analysis, >> which is a big departure from how High Energy Physics has traditionally >> been done (with private skims). Others are interested in machine learning >> on the numerical Python stack. >> >> For clarification, we don't intend to specialize Parquet-C++ or Arrow-C++ >> for the physics use-case; I offered to contribute to the core software in >> case it's incomplete in a way that prevents us from using it fully. I >> thought that Logical Type Systems were one of the design goals for >> Parquet-C++, the same way they're used in Parquet-Java to provide Parquet >> files with enough metadata to adhere to Avro schemas. In our case, we have >> ROOT StreamerInfo dictionaries that describe C++ objects; this could be >> our >> Logical Type System on top of raw Parquet primitives. >> >> Also, I'm thinking more about going through Arrow, since some of our >> use-cases involve reading ROOT data directly into Spark without >> intermediate files. We might be able to minimize effort by converting ROOT >> to Arrow and then use the existing Arrow to Parquet for files and pass the >> Arrow data through the JNI to view it in Spark. >> >> Let me know if there's a time on the 16th that works for you. >> Thanks! >> -- Jim >> >> >> >> >> On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]> wrote: >> >> > hi Jim >> > >> > Cool to hear about this use case. My gut feeling is that we should not >> > expand the scope of the parquet-cpp library itself too much beyond the >> > computational details of constructing the encoded streams / metadata >> > and writing to a file stream or decoding a file into the raw values >> > stored in each column. >> > >> > We could potentially create adapter code to convert between Parquet >> > raw (arrays of data page values, repetition, and definition levels) >> > and Avro/Protobuf data structures. >> > >> > What we've done in Arrow, since we will need a generic IO subsystem >> > for many tasks (for interacting with HDFS or other blob stores), is >> > put all of this in leaf libraries in apache/arrow (see arrow::io and >> > arrow::parquet namespaces). There isn't really the equivalent of a >> > Boost for C++ Apache projects, so arrow::io seemed like a fine place >> > to put them. >> > >> > I'm getting back to SF from an international trip on the 16th but I >> > can meet with you in the later part of the day, and anyone else is >> > welcome to join to discuss. >> > >> > - Wes >> > >> > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]> wrote: >> > > Yes that would be another way to do it. >> > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are >> > > closely >> > related. >> > > Julien >> > > >> > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]> wrote: >> > >> >> > >> Related question: could I get ROOT's complex events into Parquet >> > >> files >> > >> without inventing a Logical Type Definition by converting them to >> > >> Apache >> > >> Arrow data structures in memory, and then letting the Arrow-Parquet >> > >> integration write those data structures to files? >> > >> >> > >> Arrow could provide side-benefits, such as sharing data between >> > >> ROOT's >> > C++ >> > >> framework and JVM-based applications without intermediate files >> > >> through >> > the >> > >> JNI. (Two birds with one stone.) >> > >> >> > >> -- Jim >> > > >> > > > > > -- > Julien >
