In the last meetup we briefly discussed the vectorized reader the Drill team has written for Parquet, as David has an interest in building something similar to Facebook's Presto reader for ORC that takes this approach. I had meant to send this to the list to gauge interest from other contributors, so here is the message I sent back with information about our reader.
Hi Dan, Apologies for the late response, we've been wrapping a Drill release. The structure of the reader is a bit complicated, I will try to give a decent overview here. I am more than happy to schedule a hangout for anyone who would like to walk through the code. If anyone else is interested please message back on this thread, I could prepare something a little more structured and schedule time for a group if there is sufficient interest. Warning: wall of text ahead The record reader itself only has a few methods. The setup() method reads the parquet schema and constructs corresponding column readers for each of the entries in the schema. Currently our vectorized reader does not support nested data, we have implemented a secondary reader that uses the higher level parquet interfaces in the meantime to read files with complex schemas. To actually read the data once the column readers have been set up, successive calls to the next() method request additional records to be populated into our in memory vectorized format, Value Vectors. The main goal of the reader is to make vector copies between parquet and the Value Vectors as often as possible. Value Vectors have some basic functionality declared in a generic interface and base class, but much of their implementation is defined in generated code. This allows the accessors for particular values in a column to be type safe without relying on generics or even any objects to wrap the values. This has given us great performance, but has created a large code generation burden throughout Drill to create code that will make use of all of these similar but slightly different interfaces on each specific vector type. One design consideration that differentiates our vectorization approach from Hive's is the underlying buffer that holds data within the Value Vectors. We use the Java unsafe API and Netty's ByteBuf class which wraps the Unsafe buffers to store all of the data being processed by Drill. A similar approach is used in the parquet library with byte arrays, but we wanted to have more control over the allocation and deallocation of memory (along with a number of other performance benefits). The recent GSoC effort by Sunyu converted the read path of parquet to use ByteBuffers to make use of some of these performance improvements in the parquet-mr code. As the underlying storage is the same for all of the data types, in cases where the layout of the data and the format of each value is the same in parquet as it is in Drill, we make a vectorized copy. This is the case currently for int, float, double, and long if they are non-nullable. For nullable data, our data layout is slightly different from parquet, requiring a small modification to the strategy. The Value Vectors are designed to allow random access to values, which reduces the run time of several operations like sort. To allow random access, we actually waste space in our data arrays where nulls appear, so that all of the defined values are at known indices. This is opposite of the layout of parquet, which tries to compact everything as much as possible. Currently we have optimized this process to look for runs of defined values (by looking at the definition levels) that can be copied into our vectors all at once. We the count up the number of nulls that appear and leave those spaces in the vector blank, and set the corresponding indices our nullability vector. For variable length data, we actually need to do a value by value copy for all files written in the parquet v1 format. This is due to the encoding chose for variable length data. The original design was to encode variable length data with a series of lengths and values intermixed. As this is on the general list, here is a quick explanation for anyone not familiar. Let's say we want to store these strings in one column of a parquet file. barf, lonestar, yogurt (spaceballs anyone?) In the page that stores them the would appear like this: 4 barf 8 lonestar 6 yogurt The lengths are not actually stored in ASCII, but instead as a 4 byte int, but this demonstrates the layout. Each variable length value is preceded directly by its length in the file. This is a reasonable enough approach when materializing into a row major format where individual values must be copied into each record, but it doesn't work well for Drill. As we still want random access, we cannot spread the lengths around at random indices in the data. This requires us to make individual copies into both our vector for storing the lengths of the variable length data, as well as the actual data vector itself. This format has been updated in parquet v2, to separate the lengths and values, we still need to make a new column reader that takes advantage of this. These are the main design considerations of the individual column readers. There are a few instances such as with decimal and dates that we have to convert between the parquet format and our own for each value. This again requires a value by value copy to do the translation. It is not currently something we are concerned about, but these various types are not required to be stored in type safe vectors, so it would be possible to read into vectors that carry their simple type (int, long, fixed binary) and any system that would be designed to handle the different format could still take advantage of the vectorized reads described above. A good example would be using a UDF that can understand the parquet value format for dates, you could run a simple filter predicate against the raw data after doing a vectorized copy and then only do the translation on a subset of the data. This would likely require another UDF to convert between the formats. This is an approach we discussed, but for now we decided to just work it into the reader and materialize the values on read. Overall, there is a lot of complexity in the code, some of it is unnecessary and could use some refactoring. Anyone who would be interested in helping with this effort, or trying to make our code useful outside of Drill please message back. -Jason Altekruse On Mon, Aug 25, 2014 at 8:58 PM, Jacques Nadeau <[email protected]> wrote: > The current master version is a good place to start. We're working on > integrating a new version that will replace this at the moment but you can > see the current version here: > > > https://github.com/apache/incubator-drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet > > I've added Parth and Jason and they are the primary people working on this. > > Jason, can you give them a quick overview of the challenges and approach? > > thx, > Jacques > > > On Fri, Aug 22, 2014 at 2:00 PM, Daniel Weeks <[email protected]> wrote: > >> Jacques, >> >> At the parquet sync-up you mentioned that someone has a fork of the drill >> repo with an implementation of vectorized parquet reading. >> >> Could you point us to that repo? We're looking into a vectorized >> implementation for presto and are interested if there is anything we could >> learn from the drill approach. >> >> Thanks, >> Dan >> >>
