Hello, First off, I've been digging through as much of the code, docs and mailing lists as possible -- parquet looks like a really interesting and well-done project!
It appears that it would fit very nicely into a project I have, but I had some questions. I study high energy physics, sifting through what is originally O(10PB) of data. After several rounds of reprocessing, I end up with O(100M) events, which occupy roughly O(100GB) of storage. Each event contains a list for each different type of physical object that was detected, and each object contains several float32s, int32s, and booleans. Groups of ~100-10k events that were recorded around the same time are in groups called runs, and there's a separate list of run objects which store floats/ints/bools that are common to all of the events inside (there's actually three layers of nesting, but I'm simplifying). I then have a second program that reads through each event, performs some really trivial math (ex: multiplying floats, or possibly interpreting groups of four floats as a vector and performing vector arithmetic), and combines the result from each event to get a handful of distributions. The problem is that this second program takes forever to deserialize from our experiment's native format. A no-op of just reading each of the values and doing nothing else can only process a few hundred events per second, so even a thousand or so cores takes the better part of a night to run. Put another way, these jobs running on our site pull, on average, 250kB/sec out of our storage. I'll shortly have some time and access to a dedicated test cluster to try different things out (the initial plan is to roll hadoop, but that may change), and from what I've read, it seems like I could put together something that replaces parquet as the i/o layer (our application is c++-based, FWIW). Then, seeing as there's a number of adaptors on parquet-mr to different MR frameworks, I could continue by adapting some of our tools further. Sorry for the long winded introduction, here's my questions I couldn't solve after reading. 1) Is it possible to have two disjoint lists in the same file (e.g. events, runs)? Looking at the format, each rowgroup can have different columns, but I got a little lost walking the read path to see if the library would support that. 2) Is the C++ API intended to eventually be thread-safe from concurrent accesses, or do I need to synchronize manually? With the existing code, we open a file once then have N threads iterate over the events inside. 3) If I were to want to modify the file i/o behavior to better use our mass storage systems, would subclassing RandomAccessSource/OutputStream be considered acceptable? Will their interfaces remain reasonably stable? Thanks for your time. I'm working on compiling parquet-cpp so I can implement a proof-of-concept (though I'll need to install a java environment to actually produce the input files). If the results are promising, I'll hopefully be able to pick this project up for the next while. Cheers! Andrew -- -- Andrew Melo
