Yeah, we want the ability to query data stored in multiple formats. For example, if you have a bunch of log files in HDFS, you should be able to query them in place without having to run a load tool. The same principal applies if you want to query a table in Cassandra - a query should be able to process that data without a separate load step.
-Jason On Sat, Aug 25, 2012 at 5:08 PM, karthik tunga <[email protected]>wrote: > Does this mean that data is stored in both row and column format ? or is > the data stored in one format based on some configuration ? > > Cheers, > Karthik > > On 25 August 2012 19:36, Jason Frantz <[email protected]> wrote: > > > Hi everyone, > > > > Before sending out an architecture doc, I wanted to send out a set of > links > > to systems or research that have been influencing our design. Google's > > Dremel paper [1] does a good job at summarizing the use case of fast > > analytics, but is quite short on the actual system structure. In > addition, > > we'd like to support some data models and execution patterns outside of > > what's mentioned in that paper. > > > > The overall picture can be very roughly broken down into three > overlapping > > components. The first is the query language and data model exposed to the > > user. Our inspirations here are > > - SQL > > - BigQuery [2], which has a SQL-like language wrapped around a protocol > > buffer data model [3] > > - MongoDB, which has a JSON-derived data model > > > > The second component is the execution engine. The basic model is that > each > > query is a data flow program structured as a DAG of execution nodes, as > > expressed in Microsoft's Dryad paper [4]. Each node in the DAG is an > > operator that may be run across many machines. For examples of operators, > > see SQL Server [5]. > > > > The third component is the storage format. There are several distinct > types > > of formats we want to support: > > - Row-based w/o schema, e.g. JSON, CSV > > - Row-based w/ schema, e.g. traditional SQL, protobufs > > - Columnar-based w/ schema, e.g. columnar databases [6], Dremel, RCFile > > > > Rather than relying on the user carefully creating a series of prebuilt > > indexes for anything they want to query, we'd like to rely on in-situ > > processing whenever possible. This includes adaptive indexing techniques > > like "database cracking" [7] as well as the ability to efficiently > process > > "raw data" [8]. In addition, since we want to support several distinct > data > > formats we need to transfer between those formats. One example is varying > > between JSON, which doesn't have a consistent "schema" from one row to > the > > next, and protobufs, which do. Another example is the conversion from > > columnar format to row format [9]. > > > > Please feel free to chime in with other references that the project > should > > be looking into. > > > > -Jason > > > > [1] http://research.google.com/pubs/pub36632.html > > [2] https://developers.google.com/bigquery/docs/query-reference > > [3] https://developers.google.com/protocol-buffers/docs/proto > > [4] http://research.microsoft.com/en-us/projects/dryad/ > > [5] http://msdn.microsoft.com/en-us/library/ms191158.aspx > > [6] http://db.csail.mit.edu/projects/cstore/ > > [7] http://pdf.aminer.org/000/094/728/database_cracking.pdf > > [8] http://homepages.cwi.nl/~idreos/NoDBsigmod2012.pdf > > [9] http://db.csail.mit.edu/projects/cstore/abadiicde2007.pdf > > >
