Re: Parquet for High Energy Physics

Deepak Majeti Fri, 12 Aug 2016 12:38:22 -0700

Wes and Julien,

If it is possible to run a hangout, I am interested in participating remotely.
Thanks.


On Fri, Aug 12, 2016 at 2:44 PM, Julien Le Dem <[email protected]> wrote:
> I can book a room at Wework golden gate in SF.
> https://www.google.com/maps/place/WeWork+Golden+Gate/@37.7825083,-122.4132796,17z/data=!3m1!4b1!4m5!3m4!1s0x808580851a923a73:0x19e7d16aa0a92d68!8m2!3d37.7825041!4d-122.4110909
> I'll send an invite.
> Julien
>
> On Fri, Aug 12, 2016 at 6:18 AM, Jim Pivarski <[email protected]> wrote:
>
>> Hi everyone,
>>
>> Is someone setting a location for us to meet on the 16th, at or after 2pm?
>> If you have an office in the city, I could try meeting you there.
>>
>> I just had a meeting with the physicists interested in linking the high
>> energy physics file format (ROOT) to Spark and Pandas. Most had been
>> working independently, without knowing about each other's efforts, so it
>> was an information exchange. One group was doing a conventional physics
>> analysis, but on Scala/Spark rather than C++/ROOT, and another three (!)
>> were independently interested in using Keras and Spark through Elephas,
>> specifically to add new machine learning algorithms to their analyses. Two
>> other groups were setting up Spark facilities (at Bristol and CERN), one
>> was demonstrating machine learning with Pandas, and three of us (myself
>> included) were working on generic data conversion software from ROOT to
>> Numpy, R, Avro, and Scala. One asked me about Parquet integration
>> specifically, and I told them about Arrow. Everyone thought a standardized
>> ROOT-to-Arrow, Arrow-to-anything else would be a good way to consolidate
>> effort.
>>
>> In addition, a few more names came up of people working on similar things,
>> so I think I'm discovering an iceberg in the physics community. People are
>> working alone right now, but as soon as they find out about each other,
>> they'll be working together.
>>
>> I can bring to the meeting some materials that summarize how ROOT works
>> and typical high energy physics analysis workflows, but I'm still assuming
>> that it will be an informal conversation.
>>
>> See you next Tuesday,
>>
>>
>>
>>
>> On Tue, Aug 9, 2016 at 5:50 PM, Wes McKinney <[email protected]> wrote:
>>
>>> I'm available in SF on the 16th from 2pm onward -- location flexible.
>>> Perhaps we can run a hangout in case anyone outside of SF wants to
>>> listen to or participate in the discussion.
>>>
>>> It's an exciting application so looking forward to finding a way to
>>> collaborate and achieve common goals.
>>>
>>> Thanks
>>> Wes
>>>
>>> On Sat, Aug 6, 2016 at 10:01 AM, Julien Le Dem <[email protected]> wrote:
>>> > Hi Jim, Wes,
>>> > I'd be happy to join on the 16th as well. I'm not far from the financial
>>> > district. I can book a room there.
>>> > Julien
>>> >
>>> > On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote:
>>> >>
>>> >> Hi Wes (and the Parquet team),
>>> >>
>>> >> I've just confirmed that I'm free from 1:45pm onward on Tuesday August
>>> 16.
>>> >> I'd love to talk with you and anyone else from the team about reading
>>> >> physics data into Arrow and Parquet in C++. Let me know if there's a
>>> time
>>> >> and place that works for you. I'll be coming from Ellis and Mason
>>> Street
>>> >> near the Financial District.
>>> >>
>>> >> I started talking about this within my community, and have already
>>> found a
>>> >> dozen people who have been thinking along these lines: data exporters
>>> from
>>> >> the ROOT file format to various Big Data and machine learning tools.
>>> We're
>>> >> organizing ourselves to consolidate this effort. Three members of this
>>> >> group are setting up a Spark cluster at CERN for centralized data
>>> >> analysis,
>>> >> which is a big departure from how High Energy Physics has traditionally
>>> >> been done (with private skims). Others are interested in machine
>>> learning
>>> >> on the numerical Python stack.
>>> >>
>>> >> For clarification, we don't intend to specialize Parquet-C++ or
>>> Arrow-C++
>>> >> for the physics use-case; I offered to contribute to the core software
>>> in
>>> >> case it's incomplete in a way that prevents us from using it fully. I
>>> >> thought that Logical Type Systems were one of the design goals for
>>> >> Parquet-C++, the same way they're used in Parquet-Java to provide
>>> Parquet
>>> >> files with enough metadata to adhere to Avro schemas. In our case, we
>>> have
>>> >> ROOT StreamerInfo dictionaries that describe C++ objects; this could be
>>> >> our
>>> >> Logical Type System on top of raw Parquet primitives.
>>> >>
>>> >> Also, I'm thinking more about going through Arrow, since some of our
>>> >> use-cases involve reading ROOT data directly into Spark without
>>> >> intermediate files. We might be able to minimize effort by converting
>>> ROOT
>>> >> to Arrow and then use the existing Arrow to Parquet for files and pass
>>> the
>>> >> Arrow data through the JNI to view it in Spark.
>>> >>
>>> >> Let me know if there's a time on the 16th that works for you.
>>> >> Thanks!
>>> >> -- Jim
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]>
>>> wrote:
>>> >>
>>> >> > hi Jim
>>> >> >
>>> >> > Cool to hear about this use case. My gut feeling is that we should
>>> not
>>> >> > expand the scope of the parquet-cpp library itself too much beyond
>>> the
>>> >> > computational details of constructing the encoded streams / metadata
>>> >> > and writing to a file stream or decoding a file into the raw values
>>> >> > stored in each column.
>>> >> >
>>> >> > We could potentially create adapter code to convert between Parquet
>>> >> > raw (arrays of data page values, repetition, and definition levels)
>>> >> > and Avro/Protobuf data structures.
>>> >> >
>>> >> > What we've done in Arrow, since we will need a generic IO subsystem
>>> >> > for many tasks (for interacting with HDFS or other blob stores), is
>>> >> > put all of this in leaf libraries in apache/arrow (see arrow::io and
>>> >> > arrow::parquet namespaces). There isn't really the equivalent of a
>>> >> > Boost for C++ Apache projects, so arrow::io seemed like a fine place
>>> >> > to put them.
>>> >> >
>>> >> > I'm getting back to SF from an international trip on the 16th but I
>>> >> > can meet with you in the later part of the day, and anyone else is
>>> >> > welcome to join to discuss.
>>> >> >
>>> >> > - Wes
>>> >> >
>>> >> > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]>
>>> wrote:
>>> >> > > Yes that would be another way to do it.
>>> >> > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are
>>> >> > > closely
>>> >> > related.
>>> >> > > Julien
>>> >> > >
>>> >> > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]>
>>> wrote:
>>> >> > >>
>>> >> > >> Related question: could I get ROOT's complex events into Parquet
>>> >> > >> files
>>> >> > >> without inventing a Logical Type Definition by converting them to
>>> >> > >> Apache
>>> >> > >> Arrow data structures in memory, and then letting the
>>> Arrow-Parquet
>>> >> > >> integration write those data structures to files?
>>> >> > >>
>>> >> > >> Arrow could provide side-benefits, such as sharing data between
>>> >> > >> ROOT's
>>> >> > C++
>>> >> > >> framework and JVM-based applications without intermediate files
>>> >> > >> through
>>> >> > the
>>> >> > >> JNI. (Two birds with one stone.)
>>> >> > >>
>>> >> > >> -- Jim
>>> >> > >
>>> >> >
>>> >
>>> >
>>> >
>>> > --
>>> > Julien
>>> >
>>>
>>
>>
>
>
> --
> Julien



-- 
regards,
Deepak Majeti

Re: Parquet for High Energy Physics

Reply via email to