Hi Jim, Wes, I'd be happy to join on the 16th as well. I'm not far from the financial district. I can book a room there. Julien
On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote: > Hi Wes (and the Parquet team), > > I've just confirmed that I'm free from 1:45pm onward on Tuesday August 16. > I'd love to talk with you and anyone else from the team about reading > physics data into Arrow and Parquet in C++. Let me know if there's a time > and place that works for you. I'll be coming from Ellis and Mason Street > near the Financial District. > > I started talking about this within my community, and have already found a > dozen people who have been thinking along these lines: data exporters from > the ROOT file format to various Big Data and machine learning tools. We're > organizing ourselves to consolidate this effort. Three members of this > group are setting up a Spark cluster at CERN for centralized data analysis, > which is a big departure from how High Energy Physics has traditionally > been done (with private skims). Others are interested in machine learning > on the numerical Python stack. > > For clarification, we don't intend to specialize Parquet-C++ or Arrow-C++ > for the physics use-case; I offered to contribute to the core software in > case it's incomplete in a way that prevents us from using it fully. I > thought that Logical Type Systems were one of the design goals for > Parquet-C++, the same way they're used in Parquet-Java to provide Parquet > files with enough metadata to adhere to Avro schemas. In our case, we have > ROOT StreamerInfo dictionaries that describe C++ objects; this could be our > Logical Type System on top of raw Parquet primitives. > > Also, I'm thinking more about going through Arrow, since some of our > use-cases involve reading ROOT data directly into Spark without > intermediate files. We might be able to minimize effort by converting ROOT > to Arrow and then use the existing Arrow to Parquet for files and pass the > Arrow data through the JNI to view it in Spark. > > Let me know if there's a time on the 16th that works for you. > Thanks! > -- Jim > > > > > On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected] > <javascript:;>> wrote: > > > hi Jim > > > > Cool to hear about this use case. My gut feeling is that we should not > > expand the scope of the parquet-cpp library itself too much beyond the > > computational details of constructing the encoded streams / metadata > > and writing to a file stream or decoding a file into the raw values > > stored in each column. > > > > We could potentially create adapter code to convert between Parquet > > raw (arrays of data page values, repetition, and definition levels) > > and Avro/Protobuf data structures. > > > > What we've done in Arrow, since we will need a generic IO subsystem > > for many tasks (for interacting with HDFS or other blob stores), is > > put all of this in leaf libraries in apache/arrow (see arrow::io and > > arrow::parquet namespaces). There isn't really the equivalent of a > > Boost for C++ Apache projects, so arrow::io seemed like a fine place > > to put them. > > > > I'm getting back to SF from an international trip on the 16th but I > > can meet with you in the later part of the day, and anyone else is > > welcome to join to discuss. > > > > - Wes > > > > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected] > <javascript:;>> wrote: > > > Yes that would be another way to do it. > > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are closely > > related. > > > Julien > > > > > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected] > <javascript:;>> wrote: > > >> > > >> Related question: could I get ROOT's complex events into Parquet files > > >> without inventing a Logical Type Definition by converting them to > Apache > > >> Arrow data structures in memory, and then letting the Arrow-Parquet > > >> integration write those data structures to files? > > >> > > >> Arrow could provide side-benefits, such as sharing data between ROOT's > > C++ > > >> framework and JVM-based applications without intermediate files > through > > the > > >> JNI. (Two birds with one stone.) > > >> > > >> -- Jim > > > > > > -- Julien
