I can book a room at Wework golden gate in SF. https://www.google.com/maps/place/WeWork+Golden+Gate/@37.7825083,-122.4132796,17z/data=!3m1!4b1!4m5!3m4!1s0x808580851a923a73:0x19e7d16aa0a92d68!8m2!3d37.7825041!4d-122.4110909 I'll send an invite. Julien
On Fri, Aug 12, 2016 at 6:18 AM, Jim Pivarski <[email protected]> wrote: > Hi everyone, > > Is someone setting a location for us to meet on the 16th, at or after 2pm? > If you have an office in the city, I could try meeting you there. > > I just had a meeting with the physicists interested in linking the high > energy physics file format (ROOT) to Spark and Pandas. Most had been > working independently, without knowing about each other's efforts, so it > was an information exchange. One group was doing a conventional physics > analysis, but on Scala/Spark rather than C++/ROOT, and another three (!) > were independently interested in using Keras and Spark through Elephas, > specifically to add new machine learning algorithms to their analyses. Two > other groups were setting up Spark facilities (at Bristol and CERN), one > was demonstrating machine learning with Pandas, and three of us (myself > included) were working on generic data conversion software from ROOT to > Numpy, R, Avro, and Scala. One asked me about Parquet integration > specifically, and I told them about Arrow. Everyone thought a standardized > ROOT-to-Arrow, Arrow-to-anything else would be a good way to consolidate > effort. > > In addition, a few more names came up of people working on similar things, > so I think I'm discovering an iceberg in the physics community. People are > working alone right now, but as soon as they find out about each other, > they'll be working together. > > I can bring to the meeting some materials that summarize how ROOT works > and typical high energy physics analysis workflows, but I'm still assuming > that it will be an informal conversation. > > See you next Tuesday, > > > > > On Tue, Aug 9, 2016 at 5:50 PM, Wes McKinney <[email protected]> wrote: > >> I'm available in SF on the 16th from 2pm onward -- location flexible. >> Perhaps we can run a hangout in case anyone outside of SF wants to >> listen to or participate in the discussion. >> >> It's an exciting application so looking forward to finding a way to >> collaborate and achieve common goals. >> >> Thanks >> Wes >> >> On Sat, Aug 6, 2016 at 10:01 AM, Julien Le Dem <[email protected]> wrote: >> > Hi Jim, Wes, >> > I'd be happy to join on the 16th as well. I'm not far from the financial >> > district. I can book a room there. >> > Julien >> > >> > On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote: >> >> >> >> Hi Wes (and the Parquet team), >> >> >> >> I've just confirmed that I'm free from 1:45pm onward on Tuesday August >> 16. >> >> I'd love to talk with you and anyone else from the team about reading >> >> physics data into Arrow and Parquet in C++. Let me know if there's a >> time >> >> and place that works for you. I'll be coming from Ellis and Mason >> Street >> >> near the Financial District. >> >> >> >> I started talking about this within my community, and have already >> found a >> >> dozen people who have been thinking along these lines: data exporters >> from >> >> the ROOT file format to various Big Data and machine learning tools. >> We're >> >> organizing ourselves to consolidate this effort. Three members of this >> >> group are setting up a Spark cluster at CERN for centralized data >> >> analysis, >> >> which is a big departure from how High Energy Physics has traditionally >> >> been done (with private skims). Others are interested in machine >> learning >> >> on the numerical Python stack. >> >> >> >> For clarification, we don't intend to specialize Parquet-C++ or >> Arrow-C++ >> >> for the physics use-case; I offered to contribute to the core software >> in >> >> case it's incomplete in a way that prevents us from using it fully. I >> >> thought that Logical Type Systems were one of the design goals for >> >> Parquet-C++, the same way they're used in Parquet-Java to provide >> Parquet >> >> files with enough metadata to adhere to Avro schemas. In our case, we >> have >> >> ROOT StreamerInfo dictionaries that describe C++ objects; this could be >> >> our >> >> Logical Type System on top of raw Parquet primitives. >> >> >> >> Also, I'm thinking more about going through Arrow, since some of our >> >> use-cases involve reading ROOT data directly into Spark without >> >> intermediate files. We might be able to minimize effort by converting >> ROOT >> >> to Arrow and then use the existing Arrow to Parquet for files and pass >> the >> >> Arrow data through the JNI to view it in Spark. >> >> >> >> Let me know if there's a time on the 16th that works for you. >> >> Thanks! >> >> -- Jim >> >> >> >> >> >> >> >> >> >> On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]> >> wrote: >> >> >> >> > hi Jim >> >> > >> >> > Cool to hear about this use case. My gut feeling is that we should >> not >> >> > expand the scope of the parquet-cpp library itself too much beyond >> the >> >> > computational details of constructing the encoded streams / metadata >> >> > and writing to a file stream or decoding a file into the raw values >> >> > stored in each column. >> >> > >> >> > We could potentially create adapter code to convert between Parquet >> >> > raw (arrays of data page values, repetition, and definition levels) >> >> > and Avro/Protobuf data structures. >> >> > >> >> > What we've done in Arrow, since we will need a generic IO subsystem >> >> > for many tasks (for interacting with HDFS or other blob stores), is >> >> > put all of this in leaf libraries in apache/arrow (see arrow::io and >> >> > arrow::parquet namespaces). There isn't really the equivalent of a >> >> > Boost for C++ Apache projects, so arrow::io seemed like a fine place >> >> > to put them. >> >> > >> >> > I'm getting back to SF from an international trip on the 16th but I >> >> > can meet with you in the later part of the day, and anyone else is >> >> > welcome to join to discuss. >> >> > >> >> > - Wes >> >> > >> >> > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]> >> wrote: >> >> > > Yes that would be another way to do it. >> >> > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are >> >> > > closely >> >> > related. >> >> > > Julien >> >> > > >> >> > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]> >> wrote: >> >> > >> >> >> > >> Related question: could I get ROOT's complex events into Parquet >> >> > >> files >> >> > >> without inventing a Logical Type Definition by converting them to >> >> > >> Apache >> >> > >> Arrow data structures in memory, and then letting the >> Arrow-Parquet >> >> > >> integration write those data structures to files? >> >> > >> >> >> > >> Arrow could provide side-benefits, such as sharing data between >> >> > >> ROOT's >> >> > C++ >> >> > >> framework and JVM-based applications without intermediate files >> >> > >> through >> >> > the >> >> > >> JNI. (Two birds with one stone.) >> >> > >> >> >> > >> -- Jim >> >> > > >> >> > >> > >> > >> > >> > -- >> > Julien >> > >> > > -- Julien
