Re: Parquet for High Energy Physics

Jim Pivarski Fri, 05 Aug 2016 07:31:44 -0700

Hi Wes (and the Parquet team),

I've just confirmed that I'm free from 1:45pm onward on Tuesday August 16.
I'd love to talk with you and anyone else from the team about reading
physics data into Arrow and Parquet in C++. Let me know if there's a time
and place that works for you. I'll be coming from Ellis and Mason Street
near the Financial District.

I started talking about this within my community, and have already found a
dozen people who have been thinking along these lines: data exporters from
the ROOT file format to various Big Data and machine learning tools. We're
organizing ourselves to consolidate this effort. Three members of this
group are setting up a Spark cluster at CERN for centralized data analysis,
which is a big departure from how High Energy Physics has traditionally
been done (with private skims). Others are interested in machine learning
on the numerical Python stack.

For clarification, we don't intend to specialize Parquet-C++ or Arrow-C++
for the physics use-case; I offered to contribute to the core software in
case it's incomplete in a way that prevents us from using it fully. I
thought that Logical Type Systems were one of the design goals for
Parquet-C++, the same way they're used in Parquet-Java to provide Parquet
files with enough metadata to adhere to Avro schemas. In our case, we have
ROOT StreamerInfo dictionaries that describe C++ objects; this could be our
Logical Type System on top of raw Parquet primitives.

Also, I'm thinking more about going through Arrow, since some of our
use-cases involve reading ROOT data directly into Spark without
intermediate files. We might be able to minimize effort by converting ROOT
to Arrow and then use the existing Arrow to Parquet for files and pass the
Arrow data through the JNI to view it in Spark.

Let me know if there's a time on the 16th that works for you.
Thanks!
-- Jim

On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]> wrote:

> hi Jim
>
> Cool to hear about this use case. My gut feeling is that we should not
> expand the scope of the parquet-cpp library itself too much beyond the
> computational details of constructing the encoded streams / metadata
> and writing to a file stream or decoding a file into the raw values
> stored in each column.
>
> We could potentially create adapter code to convert between Parquet
> raw (arrays of data page values, repetition, and definition levels)
> and Avro/Protobuf data structures.
>
> What we've done in Arrow, since we will need a generic IO subsystem
> for many tasks (for interacting with HDFS or other blob stores), is
> put all of this in leaf libraries in apache/arrow (see arrow::io and
> arrow::parquet namespaces). There isn't really the equivalent of a
> Boost for C++ Apache projects, so arrow::io seemed like a fine place
> to put them.
>
> I'm getting back to SF from an international trip on the 16th but I
> can meet with you in the later part of the day, and anyone else is
> welcome to join to discuss.
>
> - Wes
>
> On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]> wrote:
> > Yes that would be another way to do it.
> > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are closely
> related.
> > Julien
> >
> >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]> wrote:
> >>
> >> Related question: could I get ROOT's complex events into Parquet files
> >> without inventing a Logical Type Definition by converting them to Apache
> >> Arrow data structures in memory, and then letting the Arrow-Parquet
> >> integration write those data structures to files?
> >>
> >> Arrow could provide side-benefits, such as sharing data between ROOT's
> C++
> >> framework and JVM-based applications without intermediate files through
> the
> >> JNI. (Two birds with one stone.)
> >>
> >> -- Jim
> >
>

Re: Parquet for High Energy Physics

Reply via email to