High performance vectorized reader meeting notes

Jason Altekruse Mon, 06 Oct 2014 22:52:53 -0700

Hello Parquet team,

I wanted to report the results of a discussion between the Drill team and
the engineers  at Netflix working to make Parquet run faster with Presto.
As we have said in the last few hangouts we both want to make contributions
back to parquet-mr to add features and performance. We thought it would be
good to sit down and speak directly about our real goals and the best next
steps to get an engineering effort started to accomplish these goals.


Below is a summary of the meeting.

- Meeting notes

   - Attendees:

       - Netflix : Eva Tse, Daniel Weeks, Zhenxiao Luo

       - MapR (Drill Team) : Jacques Nadeau, Jason Altekruse, Parth Chandra

- Minutes

   - Introductions/ Background

   - Netflix

       - Working on providing interactive SQL querying to users

       - have chosen Presto as the query engine and Parquet as high
performance data

         storage format

       - Presto is providing needed speed in some cases, but others are
missing optimizations

         that could be avoiding reads

       - Have already started some development and investigation, have
identified key goals

       - Some initial benchmarks with a modified ORC reader DWRF, written
by the Presto

         team shows that such gains are possible with a different reader
implementation

       - goals

           - filter pushdown

               - skipping reads based on filter evaluation on one or more
columns

               - this can happen at several granularities : row group,
page, record/value

           - late/lazy materialization

               - for columns not involved in a filter, avoid materializing
them entirely

                 until they are know to be needed after evaluating a filter
on other columns

   - Drill

       - the Drill engine uses an in-memory vectorized representation of
records

       - for scalar and repeated types we have implemented a fast
vectorized reader

         that is optimized to transform between Parquet's on disk and our
in-memory format

       - this is currently producing performant table scans, but has no
facility for filter

         push down

       - Major goals going forward

           - filter pushdown

               - decide the best implementation for incorporating filter
pushdown into

                 our current implementation, or figure out a way to
leverage existing

                 work in the parquet-mr library to accomplish this goal

           - late/lazy materialization

               - see above

           - contribute existing code back to parquet

               - the Drill parquet reader has a very strong emphasis on
performance, a

                 clear interface to consume, that is sufficiently separated
from Drill

                 could prove very useful for other projects

   - First steps

       - Netflix team will share some of their thoughts and research from
working with

         the DWRF code

           - we can have a discussion based off of this, which aspects are
done well,

             and any opportunities they may have missed that we can
incorporate into our

             design

           - do further investigation and ask the existing community for
guidance on existing

             parquet-mr features or planned APIs that may provide desired
functionality

       - We will begin a discussion of an API for the new functionality

           - some outstanding thoughts for down the road

               - The Drill team has an interest in very late
materialization for data stored

                 in dictionary encoded pages, such as running a join or
filter on the dictionary

                 and then going back to the reader to grab all of the
values in the data that match

                 the needed members of the dictionary

                   - this is a later consideration, but just some of the
idea of the reason we are

                     opening up the design discussion early so that the API
can be flexible enough
                     to allow this in the further, even if not implemented
too soon

High performance vectorized reader meeting notes

Reply via email to