I'll check On Fri, Aug 12, 2016 at 12:37 PM, Deepak Majeti <[email protected]> wrote:
> Wes and Julien, > > If it is possible to run a hangout, I am interested in participating > remotely. > Thanks. > > On Fri, Aug 12, 2016 at 2:44 PM, Julien Le Dem <[email protected]> wrote: > > I can book a room at Wework golden gate in SF. > > https://www.google.com/maps/place/WeWork+Golden+Gate/@37. > 7825083,-122.4132796,17z/data=!3m1!4b1!4m5!3m4!1s0x808580851a923a73: > 0x19e7d16aa0a92d68!8m2!3d37.7825041!4d-122.4110909 > > I'll send an invite. > > Julien > > > > On Fri, Aug 12, 2016 at 6:18 AM, Jim Pivarski <[email protected]> > wrote: > > > >> Hi everyone, > >> > >> Is someone setting a location for us to meet on the 16th, at or after > 2pm? > >> If you have an office in the city, I could try meeting you there. > >> > >> I just had a meeting with the physicists interested in linking the high > >> energy physics file format (ROOT) to Spark and Pandas. Most had been > >> working independently, without knowing about each other's efforts, so it > >> was an information exchange. One group was doing a conventional physics > >> analysis, but on Scala/Spark rather than C++/ROOT, and another three (!) > >> were independently interested in using Keras and Spark through Elephas, > >> specifically to add new machine learning algorithms to their analyses. > Two > >> other groups were setting up Spark facilities (at Bristol and CERN), one > >> was demonstrating machine learning with Pandas, and three of us (myself > >> included) were working on generic data conversion software from ROOT to > >> Numpy, R, Avro, and Scala. One asked me about Parquet integration > >> specifically, and I told them about Arrow. Everyone thought a > standardized > >> ROOT-to-Arrow, Arrow-to-anything else would be a good way to consolidate > >> effort. > >> > >> In addition, a few more names came up of people working on similar > things, > >> so I think I'm discovering an iceberg in the physics community. People > are > >> working alone right now, but as soon as they find out about each other, > >> they'll be working together. > >> > >> I can bring to the meeting some materials that summarize how ROOT works > >> and typical high energy physics analysis workflows, but I'm still > assuming > >> that it will be an informal conversation. > >> > >> See you next Tuesday, > >> > >> > >> > >> > >> On Tue, Aug 9, 2016 at 5:50 PM, Wes McKinney <[email protected]> > wrote: > >> > >>> I'm available in SF on the 16th from 2pm onward -- location flexible. > >>> Perhaps we can run a hangout in case anyone outside of SF wants to > >>> listen to or participate in the discussion. > >>> > >>> It's an exciting application so looking forward to finding a way to > >>> collaborate and achieve common goals. > >>> > >>> Thanks > >>> Wes > >>> > >>> On Sat, Aug 6, 2016 at 10:01 AM, Julien Le Dem <[email protected]> > wrote: > >>> > Hi Jim, Wes, > >>> > I'd be happy to join on the 16th as well. I'm not far from the > financial > >>> > district. I can book a room there. > >>> > Julien > >>> > > >>> > On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote: > >>> >> > >>> >> Hi Wes (and the Parquet team), > >>> >> > >>> >> I've just confirmed that I'm free from 1:45pm onward on Tuesday > August > >>> 16. > >>> >> I'd love to talk with you and anyone else from the team about > reading > >>> >> physics data into Arrow and Parquet in C++. Let me know if there's a > >>> time > >>> >> and place that works for you. I'll be coming from Ellis and Mason > >>> Street > >>> >> near the Financial District. > >>> >> > >>> >> I started talking about this within my community, and have already > >>> found a > >>> >> dozen people who have been thinking along these lines: data > exporters > >>> from > >>> >> the ROOT file format to various Big Data and machine learning tools. > >>> We're > >>> >> organizing ourselves to consolidate this effort. Three members of > this > >>> >> group are setting up a Spark cluster at CERN for centralized data > >>> >> analysis, > >>> >> which is a big departure from how High Energy Physics has > traditionally > >>> >> been done (with private skims). Others are interested in machine > >>> learning > >>> >> on the numerical Python stack. > >>> >> > >>> >> For clarification, we don't intend to specialize Parquet-C++ or > >>> Arrow-C++ > >>> >> for the physics use-case; I offered to contribute to the core > software > >>> in > >>> >> case it's incomplete in a way that prevents us from using it fully. > I > >>> >> thought that Logical Type Systems were one of the design goals for > >>> >> Parquet-C++, the same way they're used in Parquet-Java to provide > >>> Parquet > >>> >> files with enough metadata to adhere to Avro schemas. In our case, > we > >>> have > >>> >> ROOT StreamerInfo dictionaries that describe C++ objects; this > could be > >>> >> our > >>> >> Logical Type System on top of raw Parquet primitives. > >>> >> > >>> >> Also, I'm thinking more about going through Arrow, since some of our > >>> >> use-cases involve reading ROOT data directly into Spark without > >>> >> intermediate files. We might be able to minimize effort by > converting > >>> ROOT > >>> >> to Arrow and then use the existing Arrow to Parquet for files and > pass > >>> the > >>> >> Arrow data through the JNI to view it in Spark. > >>> >> > >>> >> Let me know if there's a time on the 16th that works for you. > >>> >> Thanks! > >>> >> -- Jim > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]> > >>> wrote: > >>> >> > >>> >> > hi Jim > >>> >> > > >>> >> > Cool to hear about this use case. My gut feeling is that we should > >>> not > >>> >> > expand the scope of the parquet-cpp library itself too much beyond > >>> the > >>> >> > computational details of constructing the encoded streams / > metadata > >>> >> > and writing to a file stream or decoding a file into the raw > values > >>> >> > stored in each column. > >>> >> > > >>> >> > We could potentially create adapter code to convert between > Parquet > >>> >> > raw (arrays of data page values, repetition, and definition > levels) > >>> >> > and Avro/Protobuf data structures. > >>> >> > > >>> >> > What we've done in Arrow, since we will need a generic IO > subsystem > >>> >> > for many tasks (for interacting with HDFS or other blob stores), > is > >>> >> > put all of this in leaf libraries in apache/arrow (see arrow::io > and > >>> >> > arrow::parquet namespaces). There isn't really the equivalent of a > >>> >> > Boost for C++ Apache projects, so arrow::io seemed like a fine > place > >>> >> > to put them. > >>> >> > > >>> >> > I'm getting back to SF from an international trip on the 16th but > I > >>> >> > can meet with you in the later part of the day, and anyone else is > >>> >> > welcome to join to discuss. > >>> >> > > >>> >> > - Wes > >>> >> > > >>> >> > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]> > >>> wrote: > >>> >> > > Yes that would be another way to do it. > >>> >> > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are > >>> >> > > closely > >>> >> > related. > >>> >> > > Julien > >>> >> > > > >>> >> > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]> > >>> wrote: > >>> >> > >> > >>> >> > >> Related question: could I get ROOT's complex events into > Parquet > >>> >> > >> files > >>> >> > >> without inventing a Logical Type Definition by converting them > to > >>> >> > >> Apache > >>> >> > >> Arrow data structures in memory, and then letting the > >>> Arrow-Parquet > >>> >> > >> integration write those data structures to files? > >>> >> > >> > >>> >> > >> Arrow could provide side-benefits, such as sharing data between > >>> >> > >> ROOT's > >>> >> > C++ > >>> >> > >> framework and JVM-based applications without intermediate files > >>> >> > >> through > >>> >> > the > >>> >> > >> JNI. (Two birds with one stone.) > >>> >> > >> > >>> >> > >> -- Jim > >>> >> > > > >>> >> > > >>> > > >>> > > >>> > > >>> > -- > >>> > Julien > >>> > > >>> > >> > >> > > > > > > -- > > Julien > > > > -- > regards, > Deepak Majeti > -- Julien
