Drill reading links

Jason Frantz Sat, 25 Aug 2012 16:37:18 -0700

Hi everyone,

Before sending out an architecture doc, I wanted to send out a set of links
to systems or research that have been influencing our design. Google's
Dremel paper [1] does a good job at summarizing the use case of fast
analytics, but is quite short on the actual system structure. In addition,
we'd like to support some data models and execution patterns outside of
what's mentioned in that paper.


The overall picture can be very roughly broken down into three overlapping
components. The first is the query language and data model exposed to the
user. Our inspirations here are
- SQL
- BigQuery [2], which has a SQL-like language wrapped around a protocol
buffer data model [3]
- MongoDB, which has a JSON-derived data model

The second component is the execution engine. The basic model is that each
query is a data flow program structured as a DAG of execution nodes, as
expressed in Microsoft's Dryad paper [4]. Each node in the DAG is an
operator that may be run across many machines. For examples of operators,
see SQL Server [5].

The third component is the storage format. There are several distinct types
of formats we want to support:
- Row-based w/o schema, e.g. JSON, CSV
- Row-based w/ schema, e.g. traditional SQL, protobufs
- Columnar-based w/ schema, e.g. columnar databases [6], Dremel, RCFile

Rather than relying on the user carefully creating a series of prebuilt
indexes for anything they want to query, we'd like to rely on in-situ
processing whenever possible. This includes adaptive indexing techniques
like "database cracking" [7] as well as the ability to efficiently process
"raw data" [8]. In addition, since we want to support several distinct data
formats we need to transfer between those formats. One example is varying
between JSON, which doesn't have a consistent "schema" from one row to the
next, and protobufs, which do. Another example is the conversion from
columnar format to row format [9].

Please feel free to chime in with other references that the project should
be looking into.

-Jason

[1] http://research.google.com/pubs/pub36632.html
[2] https://developers.google.com/bigquery/docs/query-reference
[3] https://developers.google.com/protocol-buffers/docs/proto
[4] http://research.microsoft.com/en-us/projects/dryad/
[5] http://msdn.microsoft.com/en-us/library/ms191158.aspx
[6] http://db.csail.mit.edu/projects/cstore/
[7] http://pdf.aminer.org/000/094/728/database_cracking.pdf
[8] http://homepages.cwi.nl/~idreos/NoDBsigmod2012.pdf
[9] http://db.csail.mit.edu/projects/cstore/abadiicde2007.pdf

Drill reading links

Reply via email to