Hello,

First off, I've been digging through as much of the code, docs and
mailing lists as possible -- parquet looks like a really interesting
and well-done project!

It appears that it would fit very nicely into a project I have, but I
had some questions.

I study high energy physics, sifting through what is originally
O(10PB) of data. After several rounds of reprocessing, I end up with
O(100M) events, which occupy roughly O(100GB) of storage. Each event
contains a list for each different type of physical object that was
detected, and each object contains several float32s, int32s, and
booleans. Groups of ~100-10k events that were recorded around the same
time are in groups called runs, and there's a separate list of run
objects which store floats/ints/bools that are common to all of the
events inside (there's actually three layers of nesting, but I'm
simplifying). I then have a second program that reads through each
event, performs some really trivial math (ex: multiplying floats, or
possibly interpreting groups of four floats as a vector and performing
vector arithmetic), and combines the result from each event to get a
handful of distributions.

The problem is that this second program takes forever to deserialize
from our experiment's native format. A no-op of just reading each of
the values and doing nothing else can only process a few hundred
events per second, so even a thousand or so cores takes the better
part of a night to run. Put another way, these jobs running on our
site pull, on average, 250kB/sec out of our storage.

I'll shortly have some time and access to a dedicated test cluster to
try different things out (the initial plan is to roll hadoop, but that
may change), and from what I've read, it seems like I could put
together something that replaces parquet as the i/o layer (our
application is c++-based, FWIW). Then, seeing as there's a number of
adaptors on parquet-mr to different MR frameworks, I could continue by
adapting some of our tools further.

Sorry for the long winded introduction, here's my questions I couldn't
solve after reading.

1) Is it possible to have two disjoint lists in the same file (e.g.
events, runs)? Looking at the format, each rowgroup can have different
columns, but I got a little lost walking the read path to see if the
library would support that.

2) Is the C++ API intended to eventually be thread-safe from
concurrent accesses, or do I need to synchronize manually? With the
existing code, we open a file once then have N threads iterate over
the events inside.

3) If I were to want to modify the file i/o behavior to better use our
mass storage systems, would subclassing
RandomAccessSource/OutputStream be considered acceptable? Will their
interfaces remain reasonably stable?

Thanks for your time. I'm working on compiling parquet-cpp so I can
implement a proof-of-concept (though I'll need to install a java
environment to actually produce the input files). If the results are
promising, I'll hopefully be able to pick this project up for the next
while.

Cheers!
Andrew

-- 
--
Andrew Melo

Reply via email to