Re: Parquet for High Energy Physics

Julien Le Dem Fri, 12 Aug 2016 14:57:37 -0700

I'll check

On Fri, Aug 12, 2016 at 12:37 PM, Deepak Majeti <[email protected]>
wrote:


> Wes and Julien,
>
> If it is possible to run a hangout, I am interested in participating
> remotely.
> Thanks.
>
> On Fri, Aug 12, 2016 at 2:44 PM, Julien Le Dem <[email protected]> wrote:
> > I can book a room at Wework golden gate in SF.
> > https://www.google.com/maps/place/WeWork+Golden+Gate/@37.
> 7825083,-122.4132796,17z/data=!3m1!4b1!4m5!3m4!1s0x808580851a923a73:
> 0x19e7d16aa0a92d68!8m2!3d37.7825041!4d-122.4110909
> > I'll send an invite.
> > Julien
> >
> > On Fri, Aug 12, 2016 at 6:18 AM, Jim Pivarski <[email protected]>
> wrote:
> >
> >> Hi everyone,
> >>
> >> Is someone setting a location for us to meet on the 16th, at or after
> 2pm?
> >> If you have an office in the city, I could try meeting you there.
> >>
> >> I just had a meeting with the physicists interested in linking the high
> >> energy physics file format (ROOT) to Spark and Pandas. Most had been
> >> working independently, without knowing about each other's efforts, so it
> >> was an information exchange. One group was doing a conventional physics
> >> analysis, but on Scala/Spark rather than C++/ROOT, and another three (!)
> >> were independently interested in using Keras and Spark through Elephas,
> >> specifically to add new machine learning algorithms to their analyses.
> Two
> >> other groups were setting up Spark facilities (at Bristol and CERN), one
> >> was demonstrating machine learning with Pandas, and three of us (myself
> >> included) were working on generic data conversion software from ROOT to
> >> Numpy, R, Avro, and Scala. One asked me about Parquet integration
> >> specifically, and I told them about Arrow. Everyone thought a
> standardized
> >> ROOT-to-Arrow, Arrow-to-anything else would be a good way to consolidate
> >> effort.
> >>
> >> In addition, a few more names came up of people working on similar
> things,
> >> so I think I'm discovering an iceberg in the physics community. People
> are
> >> working alone right now, but as soon as they find out about each other,
> >> they'll be working together.
> >>
> >> I can bring to the meeting some materials that summarize how ROOT works
> >> and typical high energy physics analysis workflows, but I'm still
> assuming
> >> that it will be an informal conversation.
> >>
> >> See you next Tuesday,
> >>
> >>
> >>
> >>
> >> On Tue, Aug 9, 2016 at 5:50 PM, Wes McKinney <[email protected]>
> wrote:
> >>
> >>> I'm available in SF on the 16th from 2pm onward -- location flexible.
> >>> Perhaps we can run a hangout in case anyone outside of SF wants to
> >>> listen to or participate in the discussion.
> >>>
> >>> It's an exciting application so looking forward to finding a way to
> >>> collaborate and achieve common goals.
> >>>
> >>> Thanks
> >>> Wes
> >>>
> >>> On Sat, Aug 6, 2016 at 10:01 AM, Julien Le Dem <[email protected]>
> wrote:
> >>> > Hi Jim, Wes,
> >>> > I'd be happy to join on the 16th as well. I'm not far from the
> financial
> >>> > district. I can book a room there.
> >>> > Julien
> >>> >
> >>> > On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote:
> >>> >>
> >>> >> Hi Wes (and the Parquet team),
> >>> >>
> >>> >> I've just confirmed that I'm free from 1:45pm onward on Tuesday
> August
> >>> 16.
> >>> >> I'd love to talk with you and anyone else from the team about
> reading
> >>> >> physics data into Arrow and Parquet in C++. Let me know if there's a
> >>> time
> >>> >> and place that works for you. I'll be coming from Ellis and Mason
> >>> Street
> >>> >> near the Financial District.
> >>> >>
> >>> >> I started talking about this within my community, and have already
> >>> found a
> >>> >> dozen people who have been thinking along these lines: data
> exporters
> >>> from
> >>> >> the ROOT file format to various Big Data and machine learning tools.
> >>> We're
> >>> >> organizing ourselves to consolidate this effort. Three members of
> this
> >>> >> group are setting up a Spark cluster at CERN for centralized data
> >>> >> analysis,
> >>> >> which is a big departure from how High Energy Physics has
> traditionally
> >>> >> been done (with private skims). Others are interested in machine
> >>> learning
> >>> >> on the numerical Python stack.
> >>> >>
> >>> >> For clarification, we don't intend to specialize Parquet-C++ or
> >>> Arrow-C++
> >>> >> for the physics use-case; I offered to contribute to the core
> software
> >>> in
> >>> >> case it's incomplete in a way that prevents us from using it fully.
> I
> >>> >> thought that Logical Type Systems were one of the design goals for
> >>> >> Parquet-C++, the same way they're used in Parquet-Java to provide
> >>> Parquet
> >>> >> files with enough metadata to adhere to Avro schemas. In our case,
> we
> >>> have
> >>> >> ROOT StreamerInfo dictionaries that describe C++ objects; this
> could be
> >>> >> our
> >>> >> Logical Type System on top of raw Parquet primitives.
> >>> >>
> >>> >> Also, I'm thinking more about going through Arrow, since some of our
> >>> >> use-cases involve reading ROOT data directly into Spark without
> >>> >> intermediate files. We might be able to minimize effort by
> converting
> >>> ROOT
> >>> >> to Arrow and then use the existing Arrow to Parquet for files and
> pass
> >>> the
> >>> >> Arrow data through the JNI to view it in Spark.
> >>> >>
> >>> >> Let me know if there's a time on the 16th that works for you.
> >>> >> Thanks!
> >>> >> -- Jim
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]>
> >>> wrote:
> >>> >>
> >>> >> > hi Jim
> >>> >> >
> >>> >> > Cool to hear about this use case. My gut feeling is that we should
> >>> not
> >>> >> > expand the scope of the parquet-cpp library itself too much beyond
> >>> the
> >>> >> > computational details of constructing the encoded streams /
> metadata
> >>> >> > and writing to a file stream or decoding a file into the raw
> values
> >>> >> > stored in each column.
> >>> >> >
> >>> >> > We could potentially create adapter code to convert between
> Parquet
> >>> >> > raw (arrays of data page values, repetition, and definition
> levels)
> >>> >> > and Avro/Protobuf data structures.
> >>> >> >
> >>> >> > What we've done in Arrow, since we will need a generic IO
> subsystem
> >>> >> > for many tasks (for interacting with HDFS or other blob stores),
> is
> >>> >> > put all of this in leaf libraries in apache/arrow (see arrow::io
> and
> >>> >> > arrow::parquet namespaces). There isn't really the equivalent of a
> >>> >> > Boost for C++ Apache projects, so arrow::io seemed like a fine
> place
> >>> >> > to put them.
> >>> >> >
> >>> >> > I'm getting back to SF from an international trip on the 16th but
> I
> >>> >> > can meet with you in the later part of the day, and anyone else is
> >>> >> > welcome to join to discuss.
> >>> >> >
> >>> >> > - Wes
> >>> >> >
> >>> >> > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]>
> >>> wrote:
> >>> >> > > Yes that would be another way to do it.
> >>> >> > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are
> >>> >> > > closely
> >>> >> > related.
> >>> >> > > Julien
> >>> >> > >
> >>> >> > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]>
> >>> wrote:
> >>> >> > >>
> >>> >> > >> Related question: could I get ROOT's complex events into
> Parquet
> >>> >> > >> files
> >>> >> > >> without inventing a Logical Type Definition by converting them
> to
> >>> >> > >> Apache
> >>> >> > >> Arrow data structures in memory, and then letting the
> >>> Arrow-Parquet
> >>> >> > >> integration write those data structures to files?
> >>> >> > >>
> >>> >> > >> Arrow could provide side-benefits, such as sharing data between
> >>> >> > >> ROOT's
> >>> >> > C++
> >>> >> > >> framework and JVM-based applications without intermediate files
> >>> >> > >> through
> >>> >> > the
> >>> >> > >> JNI. (Two birds with one stone.)
> >>> >> > >>
> >>> >> > >> -- Jim
> >>> >> > >
> >>> >> >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Julien
> >>> >
> >>>
> >>
> >>
> >
> >
> > --
> > Julien
>
>
>
> --
> regards,
> Deepak Majeti
>



-- 
Julien

Re: Parquet for High Energy Physics

Reply via email to