Re: Parquet for High Energy Physics

Wes McKinney Tue, 09 Aug 2016 15:51:36 -0700

I'm available in SF on the 16th from 2pm onward -- location flexible.
Perhaps we can run a hangout in case anyone outside of SF wants to
listen to or participate in the discussion.


It's an exciting application so looking forward to finding a way to
collaborate and achieve common goals.

Thanks
Wes

On Sat, Aug 6, 2016 at 10:01 AM, Julien Le Dem <[email protected]> wrote:
> Hi Jim, Wes,
> I'd be happy to join on the 16th as well. I'm not far from the financial
> district. I can book a room there.
> Julien
>
> On Friday, August 5, 2016, Jim Pivarski <[email protected]> wrote:
>>
>> Hi Wes (and the Parquet team),
>>
>> I've just confirmed that I'm free from 1:45pm onward on Tuesday August 16.
>> I'd love to talk with you and anyone else from the team about reading
>> physics data into Arrow and Parquet in C++. Let me know if there's a time
>> and place that works for you. I'll be coming from Ellis and Mason Street
>> near the Financial District.
>>
>> I started talking about this within my community, and have already found a
>> dozen people who have been thinking along these lines: data exporters from
>> the ROOT file format to various Big Data and machine learning tools. We're
>> organizing ourselves to consolidate this effort. Three members of this
>> group are setting up a Spark cluster at CERN for centralized data
>> analysis,
>> which is a big departure from how High Energy Physics has traditionally
>> been done (with private skims). Others are interested in machine learning
>> on the numerical Python stack.
>>
>> For clarification, we don't intend to specialize Parquet-C++ or Arrow-C++
>> for the physics use-case; I offered to contribute to the core software in
>> case it's incomplete in a way that prevents us from using it fully. I
>> thought that Logical Type Systems were one of the design goals for
>> Parquet-C++, the same way they're used in Parquet-Java to provide Parquet
>> files with enough metadata to adhere to Avro schemas. In our case, we have
>> ROOT StreamerInfo dictionaries that describe C++ objects; this could be
>> our
>> Logical Type System on top of raw Parquet primitives.
>>
>> Also, I'm thinking more about going through Arrow, since some of our
>> use-cases involve reading ROOT data directly into Spark without
>> intermediate files. We might be able to minimize effort by converting ROOT
>> to Arrow and then use the existing Arrow to Parquet for files and pass the
>> Arrow data through the JNI to view it in Spark.
>>
>> Let me know if there's a time on the 16th that works for you.
>> Thanks!
>> -- Jim
>>
>>
>>
>>
>> On Wed, Aug 3, 2016 at 12:58 PM, Wes McKinney <[email protected]> wrote:
>>
>> > hi Jim
>> >
>> > Cool to hear about this use case. My gut feeling is that we should not
>> > expand the scope of the parquet-cpp library itself too much beyond the
>> > computational details of constructing the encoded streams / metadata
>> > and writing to a file stream or decoding a file into the raw values
>> > stored in each column.
>> >
>> > We could potentially create adapter code to convert between Parquet
>> > raw (arrays of data page values, repetition, and definition levels)
>> > and Avro/Protobuf data structures.
>> >
>> > What we've done in Arrow, since we will need a generic IO subsystem
>> > for many tasks (for interacting with HDFS or other blob stores), is
>> > put all of this in leaf libraries in apache/arrow (see arrow::io and
>> > arrow::parquet namespaces). There isn't really the equivalent of a
>> > Boost for C++ Apache projects, so arrow::io seemed like a fine place
>> > to put them.
>> >
>> > I'm getting back to SF from an international trip on the 16th but I
>> > can meet with you in the later part of the day, and anyone else is
>> > welcome to join to discuss.
>> >
>> > - Wes
>> >
>> > On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]> wrote:
>> > > Yes that would be another way to do it.
>> > > The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are
>> > > closely
>> > related.
>> > > Julien
>> > >
>> > >> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]> wrote:
>> > >>
>> > >> Related question: could I get ROOT's complex events into Parquet
>> > >> files
>> > >> without inventing a Logical Type Definition by converting them to
>> > >> Apache
>> > >> Arrow data structures in memory, and then letting the Arrow-Parquet
>> > >> integration write those data structures to files?
>> > >>
>> > >> Arrow could provide side-benefits, such as sharing data between
>> > >> ROOT's
>> > C++
>> > >> framework and JVM-based applications without intermediate files
>> > >> through
>> > the
>> > >> JNI. (Two birds with one stone.)
>> > >>
>> > >> -- Jim
>> > >
>> >
>
>
>
> --
> Julien
>

Re: Parquet for High Energy Physics

Reply via email to