Hello Parquet team,
I wanted to report the results of a discussion between the Drill team and
the engineers at Netflix working to make Parquet run faster with Presto.
As we have said in the last few hangouts we both want to make contributions
back to parquet-mr to add features and performance. We thought it would be
good to sit down and speak directly about our real goals and the best next
steps to get an engineering effort started to accomplish these goals.
Below is a summary of the meeting.
- Meeting notes
- Attendees:
- Netflix : Eva Tse, Daniel Weeks, Zhenxiao Luo
- MapR (Drill Team) : Jacques Nadeau, Jason Altekruse, Parth Chandra
- Minutes
- Introductions/ Background
- Netflix
- Working on providing interactive SQL querying to users
- have chosen Presto as the query engine and Parquet as high
performance data
storage format
- Presto is providing needed speed in some cases, but others are
missing optimizations
that could be avoiding reads
- Have already started some development and investigation, have
identified key goals
- Some initial benchmarks with a modified ORC reader DWRF, written
by the Presto
team shows that such gains are possible with a different reader
implementation
- goals
- filter pushdown
- skipping reads based on filter evaluation on one or more
columns
- this can happen at several granularities : row group,
page, record/value
- late/lazy materialization
- for columns not involved in a filter, avoid materializing
them entirely
until they are know to be needed after evaluating a filter
on other columns
- Drill
- the Drill engine uses an in-memory vectorized representation of
records
- for scalar and repeated types we have implemented a fast
vectorized reader
that is optimized to transform between Parquet's on disk and our
in-memory format
- this is currently producing performant table scans, but has no
facility for filter
push down
- Major goals going forward
- filter pushdown
- decide the best implementation for incorporating filter
pushdown into
our current implementation, or figure out a way to
leverage existing
work in the parquet-mr library to accomplish this goal
- late/lazy materialization
- see above
- contribute existing code back to parquet
- the Drill parquet reader has a very strong emphasis on
performance, a
clear interface to consume, that is sufficiently separated
from Drill
could prove very useful for other projects
- First steps
- Netflix team will share some of their thoughts and research from
working with
the DWRF code
- we can have a discussion based off of this, which aspects are
done well,
and any opportunities they may have missed that we can
incorporate into our
design
- do further investigation and ask the existing community for
guidance on existing
parquet-mr features or planned APIs that may provide desired
functionality
- We will begin a discussion of an API for the new functionality
- some outstanding thoughts for down the road
- The Drill team has an interest in very late
materialization for data stored
in dictionary encoded pages, such as running a join or
filter on the dictionary
and then going back to the reader to grab all of the
values in the data that match
the needed members of the dictionary
- this is a later consideration, but just some of the
idea of the reason we are
opening up the design discussion early so that the API
can be flexible enough
to allow this in the further, even if not implemented
too soon