Re: Parquet for High Energy Physics

Jim Pivarski Fri, 12 Aug 2016 06:20:12 -0700

Hi everyone,

Is someone setting a location for us to meet on the 16th, at or after 2pm?
If you have an office in the city, I could try meeting you there.


I just had a meeting with the physicists interested in linking the high
energy physics file format (ROOT) to Spark and Pandas. Most had been
working independently, without knowing about each other's efforts, so it
was an information exchange. One group was doing a conventional physics
analysis, but on Scala/Spark rather than C++/ROOT, and another three (!)
were independently interested in using Keras and Spark through Elephas,
specifically to add new machine learning algorithms to their analyses. Two
other groups were setting up Spark facilities (at Bristol and CERN), one
was demonstrating machine learning with Pandas, and three of us (myself
included) were working on generic data conversion software from ROOT to
Numpy, R, Avro, and Scala. One asked me about Parquet integration
specifically, and I told them about Arrow. Everyone thought a standardized
ROOT-to-Arrow, Arrow-to-anything else would be a good way to consolidate
effort.

In addition, a few more names came up of people working on similar things,
so I think I'm discovering an iceberg in the physics community. People are
working alone right now, but as soon as they find out about each other,
they'll be working together.

I can bring to the meeting some materials that summarize how ROOT works and
typical high energy physics analysis workflows, but I'm still assuming that
it will be an informal conversation.

See you next Tuesday,




On Tue, Aug 9, 2016 at 5:50 PM, Wes McKinney <[email protected]> wrote:

> I'm available in SF on the 16th from 2pm onward -- location flexible.
> Perhaps we can run a hangout in case anyone outside of SF wants to
> listen to or participate in the discussion.
>
> It's an exciting application so looking forward to finding a way to
> collaborate and achieve common goals.
>
> Thanks
> Wes
>
> On Sat, Aug 6, 2016 at 10:01 AM, Julien Le Dem <[email protected]> wrote:
> > Hi Jim, Wes,
> > I'd be happy to join on the 16th as well. I'm not far from the financial
> > district. I can book a room there.
> > Julien
> >
> > On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote:
> >>
> >> Hi Wes (and the Parquet team),
> >>
> >> I've just confirmed that I'm free from 1:45pm onward on Tuesday August
> 16.
> >> I'd love to talk with you and anyone else from the team about reading
> >> physics data into Arrow and Parquet in C++. Let me know if there's a
> time
> >> and place that works for you. I'll be coming from Ellis and Mason Street
> >> near the Financial District.
> >>
> >> I started talking about this within my community, and have already
> found a
> >> dozen people who have been thinking along these lines: data exporters
> from
> >> the ROOT file format to various Big Data and machine learning tools.
> We're
> >> organizing ourselves to consolidate this effort. Three members of this
> >> group are setting up a Spark cluster at CERN for centralized data
> >> analysis,
> >> which is a big departure from how High Energy Physics has traditionally
> >> been done (with private skims). Others are interested in machine
> learning
> >> on the numerical Python stack.
> >>
> >> For clarification, we don't intend to specialize Parquet-C++ or
> Arrow-C++
> >> for the physics use-case; I offered to contribute to the core software
> in
> >> case it's incomplete in a way that prevents us from using it fully. I
> >> thought that Logical Type Systems were one of the design goals for
> >> Parquet-C++, the same way they're used in Parquet-Java to provide
> Parquet
> >> files with enough metadata to adhere to Avro schemas. In our case, we
> have
> >> ROOT StreamerInfo dictionaries that describe C++ objects; this could be
> >> our
> >> Logical Type System on top of raw Parquet primitives.
> >>
> >> Also, I'm thinking more about going through Arrow, since some of our
> >> use-cases involve reading ROOT data directly into Spark without
> >> intermediate files. We might be able to minimize effort by converting
> ROOT
> >> to Arrow and then use the existing Arrow to Parquet for files and pass
> the
> >> Arrow data through the JNI to view it in Spark.
> >>
> >> Let me know if there's a time on the 16th that works for you.
> >> Thanks!
> >> -- Jim
> >>
> >>
> >>
> >>
> >> On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]>
> wrote:
> >>
> >> > hi Jim
> >> >
> >> > Cool to hear about this use case. My gut feeling is that we should not
> >> > expand the scope of the parquet-cpp library itself too much beyond the
> >> > computational details of constructing the encoded streams / metadata
> >> > and writing to a file stream or decoding a file into the raw values
> >> > stored in each column.
> >> >
> >> > We could potentially create adapter code to convert between Parquet
> >> > raw (arrays of data page values, repetition, and definition levels)
> >> > and Avro/Protobuf data structures.
> >> >
> >> > What we've done in Arrow, since we will need a generic IO subsystem
> >> > for many tasks (for interacting with HDFS or other blob stores), is
> >> > put all of this in leaf libraries in apache/arrow (see arrow::io and
> >> > arrow::parquet namespaces). There isn't really the equivalent of a
> >> > Boost for C++ Apache projects, so arrow::io seemed like a fine place
> >> > to put them.
> >> >
> >> > I'm getting back to SF from an international trip on the 16th but I
> >> > can meet with you in the later part of the day, and anyone else is
> >> > welcome to join to discuss.
> >> >
> >> > - Wes
> >> >
> >> > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]>
> wrote:
> >> > > Yes that would be another way to do it.
> >> > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are
> >> > > closely
> >> > related.
> >> > > Julien
> >> > >
> >> > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]>
> wrote:
> >> > >>
> >> > >> Related question: could I get ROOT's complex events into Parquet
> >> > >> files
> >> > >> without inventing a Logical Type Definition by converting them to
> >> > >> Apache
> >> > >> Arrow data structures in memory, and then letting the Arrow-Parquet
> >> > >> integration write those data structures to files?
> >> > >>
> >> > >> Arrow could provide side-benefits, such as sharing data between
> >> > >> ROOT's
> >> > C++
> >> > >> framework and JVM-based applications without intermediate files
> >> > >> through
> >> > the
> >> > >> JNI. (Two birds with one stone.)
> >> > >>
> >> > >> -- Jim
> >> > >
> >> >
> >
> >
> >
> > --
> > Julien
> >
>

Re: Parquet for High Energy Physics

Reply via email to