Vectorized Drill Parquet Reader

Jason Altekruse Fri, 05 Sep 2014 12:14:25 -0700

In the last meetup we briefly discussed the vectorized reader the Drill
team has written for Parquet, as David has an interest in building
something similar to Facebook's Presto reader for ORC that takes this
approach. I had meant to send this to the list to gauge interest from other
contributors, so here is the message I sent back with information about our
reader.

Hi Dan,

Apologies for the late response, we've been wrapping a Drill release.

The structure of the reader is a bit complicated, I will try to give a
decent overview here. I am more than happy to schedule a hangout for anyone
who would like to walk through the code. If anyone else is interested
please message back on this thread, I could prepare something a little more
structured and schedule time for a group if there is sufficient interest.

Warning: wall of text ahead

The record reader itself only has a few methods. The setup() method reads
the parquet schema and constructs corresponding column readers for each of
the entries in the schema. Currently our vectorized reader does not support
nested data, we have implemented a secondary reader that uses the higher
level parquet interfaces in the meantime to read files with complex schemas.

To actually read the data once the column readers have been set up,
successive calls to the next() method request additional records to be
populated into our in memory vectorized format, Value Vectors.

The main goal of the reader is to make vector copies between parquet and
the Value Vectors as often as possible. Value Vectors have some basic
functionality declared in a generic interface and base class, but much of
their implementation is defined in generated code. This allows the
accessors for particular values in a column to be type safe without relying
on generics or even any objects to wrap the values. This has given us great
performance, but has created a large code generation burden throughout
Drill to create code that will make use of all of these similar but
slightly different interfaces on each specific vector type.

One design consideration that differentiates our vectorization approach
from Hive's is the underlying buffer that holds data within the Value
Vectors. We use the Java unsafe API and Netty's ByteBuf class which wraps
the Unsafe buffers to store all of the data being processed by Drill. A
similar approach is used in the parquet library with byte arrays, but we
wanted to have more control over the allocation and deallocation of memory
(along with a number of other performance benefits). The recent GSoC effort
by Sunyu converted the read path of parquet to use ByteBuffers to make use
of some of these performance improvements in the parquet-mr code.

As the underlying storage is the same for all of the data types, in cases
where the layout of the data and the format of each value is the same in
parquet as it is in Drill, we make a vectorized copy. This is the case
currently for int, float, double, and long if they are non-nullable.

For nullable data, our data layout is slightly different from parquet,
requiring a small modification to the strategy. The Value Vectors are
designed to allow random access to values, which reduces the run time of
several operations like sort. To allow random access, we actually waste
space in our data arrays where nulls appear, so that all of the defined
values are at known indices. This is opposite of the layout of parquet,
which tries to compact everything as much as possible. Currently we have
optimized this process to look for runs of defined values (by looking at
the definition levels) that can be copied into our vectors all at once. We
the count up the number of nulls that appear and leave those spaces in the
vector blank, and set the corresponding indices our nullability vector.

For variable length data, we actually need to do a value by value copy for
all files written in the parquet v1 format. This is due to the encoding
chose for variable length data. The original design was to encode variable
length data with a series of lengths and values intermixed.

As this is on the general list, here is a quick explanation for anyone not
familiar.
Let's say we want to store these strings in one column of a parquet file.
barf, lonestar, yogurt (spaceballs anyone?)

In the page that stores them the would appear like this:
4 barf 8 lonestar 6 yogurt

The lengths are not actually stored in ASCII, but instead as a 4 byte int,
but this demonstrates the layout. Each variable length value is preceded
directly by its length in the file. This is a reasonable enough approach
when materializing into a row major format where individual values must be
copied into each record, but it doesn't work well for Drill. As we still
want random access, we cannot spread the lengths around at random indices
in the data. This requires us to make individual copies into both our
vector for storing the lengths of the variable length data, as well as the
actual data vector itself.

This format has been updated in parquet v2, to separate the lengths and
values, we still need to make a new column reader that takes advantage of
this.

These are the main design considerations of the individual column readers.
There are a few instances such as with decimal and dates that we have to
convert between the parquet format and our own for each value. This again
requires a value by value copy to do the translation.

It is not currently something we are concerned about, but these various
types are not required to be stored in type safe vectors, so it would be
possible to read into vectors that carry their simple type (int, long,
fixed binary) and any system that would be designed to handle the different
format could still take advantage of the vectorized reads described above.
A good example would be using a UDF that can understand the parquet value
format for dates, you could run a simple filter predicate against the raw
data after doing a vectorized copy and then only do the translation on a
subset of the data. This would likely require another UDF to convert
between the formats. This is an approach we discussed, but for now we
decided to just work it into the reader and materialize the values on read.

Overall, there is a lot of complexity in the code, some of it is
unnecessary and could use some refactoring. Anyone who would be interested
in helping with this effort, or trying to make our code useful outside of
Drill please message back.

-Jason Altekruse

On Mon, Aug 25, 2014 at 8:58 PM, Jacques Nadeau <[email protected]> wrote:

> The current master version is a good place to start.  We're working on
> integrating a new version that will replace this at the moment but you can
> see the current version here:
>
>
> https://github.com/apache/incubator-drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet
>
> I've added Parth and Jason and they are the primary people working on this.
>
> Jason, can you give them a quick overview of the challenges and approach?
>
> thx,
> Jacques
>
>
> On Fri, Aug 22, 2014 at 2:00 PM, Daniel Weeks <[email protected]> wrote:
>
>> Jacques,
>>
>> At the parquet sync-up you mentioned that someone has a fork of the drill
>> repo with an implementation of vectorized parquet reading.
>>
>> Could you point us to that repo?  We're looking into a vectorized
>> implementation for presto and are interested if there is anything we could
>> learn from the drill approach.
>>
>> Thanks,
>> Dan
>>
>>

Vectorized Drill Parquet Reader

Reply via email to