Re: Parquet for High Energy Physics

Wes McKinney Wed, 03 Aug 2016 10:59:18 -0700

hi Jim

Cool to hear about this use case. My gut feeling is that we should not
expand the scope of the parquet-cpp library itself too much beyond the
computational details of constructing the encoded streams / metadata
and writing to a file stream or decoding a file into the raw values
stored in each column.

We could potentially create adapter code to convert between Parquet
raw (arrays of data page values, repetition, and definition levels)
and Avro/Protobuf data structures.

What we've done in Arrow, since we will need a generic IO subsystem
for many tasks (for interacting with HDFS or other blob stores), is
put all of this in leaf libraries in apache/arrow (see arrow::io and
arrow::parquet namespaces). There isn't really the equivalent of a
Boost for C++ Apache projects, so arrow::io seemed like a fine place
to put them.

I'm getting back to SF from an international trip on the 16th but I
can meet with you in the later part of the day, and anyone else is
welcome to join to discuss.

- Wes

On Wed, Aug 3, 2016 at 10:04 AM, Julien Le Dem <[email protected]> wrote:
> Yes that would be another way to do it.
> The Parquet-cpp/parquet-arrow integration/arrow cpp efforts are closely 
> related.
> Julien
>
>> On Aug 3, 2016, at 9:41 AM, Jim Pivarski <[email protected]> wrote:
>>
>> Related question: could I get ROOT's complex events into Parquet files
>> without inventing a Logical Type Definition by converting them to Apache
>> Arrow data structures in memory, and then letting the Arrow-Parquet
>> integration write those data structures to files?
>>
>> Arrow could provide side-benefits, such as sharing data between ROOT's C++
>> framework and JVM-based applications without intermediate files through the
>> JNI. (Two birds with one stone.)
>>
>> -- Jim
>

Re: Parquet for High Energy Physics

Reply via email to