Does this mean that data is stored in both row and column format ? or is the data stored in one format based on some configuration ?
Cheers, Karthik On 25 August 2012 19:36, Jason Frantz <[email protected]> wrote: > Hi everyone, > > Before sending out an architecture doc, I wanted to send out a set of links > to systems or research that have been influencing our design. Google's > Dremel paper [1] does a good job at summarizing the use case of fast > analytics, but is quite short on the actual system structure. In addition, > we'd like to support some data models and execution patterns outside of > what's mentioned in that paper. > > The overall picture can be very roughly broken down into three overlapping > components. The first is the query language and data model exposed to the > user. Our inspirations here are > - SQL > - BigQuery [2], which has a SQL-like language wrapped around a protocol > buffer data model [3] > - MongoDB, which has a JSON-derived data model > > The second component is the execution engine. The basic model is that each > query is a data flow program structured as a DAG of execution nodes, as > expressed in Microsoft's Dryad paper [4]. Each node in the DAG is an > operator that may be run across many machines. For examples of operators, > see SQL Server [5]. > > The third component is the storage format. There are several distinct types > of formats we want to support: > - Row-based w/o schema, e.g. JSON, CSV > - Row-based w/ schema, e.g. traditional SQL, protobufs > - Columnar-based w/ schema, e.g. columnar databases [6], Dremel, RCFile > > Rather than relying on the user carefully creating a series of prebuilt > indexes for anything they want to query, we'd like to rely on in-situ > processing whenever possible. This includes adaptive indexing techniques > like "database cracking" [7] as well as the ability to efficiently process > "raw data" [8]. In addition, since we want to support several distinct data > formats we need to transfer between those formats. One example is varying > between JSON, which doesn't have a consistent "schema" from one row to the > next, and protobufs, which do. Another example is the conversion from > columnar format to row format [9]. > > Please feel free to chime in with other references that the project should > be looking into. > > -Jason > > [1] http://research.google.com/pubs/pub36632.html > [2] https://developers.google.com/bigquery/docs/query-reference > [3] https://developers.google.com/protocol-buffers/docs/proto > [4] http://research.microsoft.com/en-us/projects/dryad/ > [5] http://msdn.microsoft.com/en-us/library/ms191158.aspx > [6] http://db.csail.mit.edu/projects/cstore/ > [7] http://pdf.aminer.org/000/094/728/database_cracking.pdf > [8] http://homepages.cwi.nl/~idreos/NoDBsigmod2012.pdf > [9] http://db.csail.mit.edu/projects/cstore/abadiicde2007.pdf >
