hi Andrew, I can make some specific comments about parquet-cpp. Note that it's still very much at an alpha stage of development, so you may need to submit some patches for your needs, but such is the price of progress, right? =) On the bright side, there's a number of us here on the list who need Parquet's C++ library to reach a beta / suitable for production (+ some ongoing API instability) stage in the new few months, so for bug fixes and ongoing maintenance you won't be completely on your own.
more responses inline On Sat, Mar 12, 2016 at 2:50 PM, Andrew Melo <[email protected]> wrote: > Hello, > > First off, I've been digging through as much of the code, docs and > mailing lists as possible -- parquet looks like a really interesting > and well-done project! > > It appears that it would fit very nicely into a project I have, but I > had some questions. > > I study high energy physics, sifting through what is originally > O(10PB) of data. After several rounds of reprocessing, I end up with > O(100M) events, which occupy roughly O(100GB) of storage. Each event > contains a list for each different type of physical object that was > detected, and each object contains several float32s, int32s, and > booleans. Groups of ~100-10k events that were recorded around the same > time are in groups called runs, and there's a separate list of run > objects which store floats/ints/bools that are common to all of the > events inside (there's actually three layers of nesting, but I'm > simplifying). I then have a second program that reads through each > event, performs some really trivial math (ex: multiplying floats, or > possibly interpreting groups of four floats as a vector and performing > vector arithmetic), and combines the result from each event to get a > handful of distributions. > > The problem is that this second program takes forever to deserialize > from our experiment's native format. A no-op of just reading each of > the values and doing nothing else can only process a few hundred > events per second, so even a thousand or so cores takes the better > part of a night to run. Put another way, these jobs running on our > site pull, on average, 250kB/sec out of our storage. > > I'll shortly have some time and access to a dedicated test cluster to > try different things out (the initial plan is to roll hadoop, but that > may change), and from what I've read, it seems like I could put > together something that replaces parquet as the i/o layer (our > application is c++-based, FWIW). Then, seeing as there's a number of > adaptors on parquet-mr to different MR frameworks, I could continue by > adapting some of our tools further. > > Sorry for the long winded introduction, here's my questions I couldn't > solve after reading. > > 1) Is it possible to have two disjoint lists in the same file (e.g. > events, runs)? Looking at the format, each rowgroup can have different > columns, but I got a little lost walking the read path to see if the > library would support that. You should be able to read each row group independently, so this should not be an issue. The row groups do all share the same file source, though, more on that below. > > 2) Is the C++ API intended to eventually be thread-safe from > concurrent accesses, or do I need to synchronize manually? With the > existing code, we open a file once then have N threads iterate over > the events inside. When a column chunk is read from the file, it is pulled into "memory" here: https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader-internal.cc#L172 I put "memory" in quotations since it depends whether you are using the memory-mapped file source or not (it is on by default). Eventually, you hit MemoryMapSource::Read https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/input.cc#L170 As you can see, this is touching some member variables which are not protected by any mutexes or other lock variables, so this is currently not thread-safe. There is this JIRA available: https://issues.apache.org/jira/browse/PARQUET-474 Personally, I'd prefer to keep synchronization / locking of any shared resources internal to the library. For scanning column chunks, multi-threading is more use case dependent. I would like to enable scanning Parquet columns in a pipeline'd context, meaning that you should have a set of APIs to enable you to ask for the next batch of values and repetition/definition levels to be decoded while you are still processing the last batch. This is not possible for all data types right now (only the primitive types, not BYTE_ARRAY / FIXED_LEN_BYTE_ARRAY) because there is a common memory used between decode batches. So we'd need to return data batches along with a memory buffer (for BYTE_ARRAY data, for example) that you are responsible for destructing (perhaps by letting the returned unique_ptr / shared_ptr fall out of scope). I can go into this in more detail if this isn't clear. > > 3) If I were to want to modify the file i/o behavior to better use our > mass storage systems, would subclassing > RandomAccessSource/OutputStream be considered acceptable? Will their > interfaces remain reasonably stable? > Absolutely. any changes that can go directly into parquet-cpp that would be ideal (outside of changes specific to your systems, of course), and if you need to evolve the API in some way please let us know. > Thanks for your time. I'm working on compiling parquet-cpp so I can > implement a proof-of-concept (though I'll need to install a java > environment to actually produce the input files). If the results are > promising, I'll hopefully be able to pick this project up for the next > while. Cool, let us know if you run into any problems (with compilation or otherwise). We haven't done any code optimization work or profiling so if performance is an issue I am sure there are some low hanging fruit. - Wes > > Cheers! > Andrew > > -- > -- > Andrew Melo
