Can't say much for Dryad but Dremel points seems correct to me. I would add that main idea behind Dremel is reliance on a single scan.
On Tue, Aug 28, 2012 at 6:46 AM, Dharm Raj <[email protected]>wrote: > After going through Dremel & Dryad paper, Here is my understanding -- > > 1. Columnar storage is chosen so that non-required column of a record can > be avoided and hence less IO. > 2. All values of a field are kept together to improve retrieval efficiency. > From this my understanding is that if that particular field is required in > query, all values can be fetched in one seek efficiently. > 3. There is no detail in paper about how to store values, repetition level > & definition levels. As David said, it can be done having separate files > for value, repetition level & definition level. on top of this we need to > index record so that we can seek at right position and fetch desired values > only or read more and discard later. > 4. I agree on data locality part with Ted and Camuel. It is desired but not > mandatory. Dremel paper states that Dremel has ability to access local data > or data in GFS or other store like BigTable. > 5. Dremel and Dryad both mentions similar way to retrieve data using > serving tree, each node acts (independently) as an operator or run some > custom code. User submitted query is translated to form a DAG of execution. > Dryad states that relational algebra can be expressed as DAG. General graph > are more complicated to implement and need to take care of cycles during > execution. Hence Dryad chosen DAG as a query execution model. > > > Please throw your understanding on this to enhance(correct) mine. > > Regards, > Dharm > > > On Tue, Aug 28, 2012 at 4:40 AM, Camuel Gilyadov <[email protected]> wrote: > > > On Mon, Aug 27, 2012 at 8:40 PM, Min Zhou <[email protected]> wrote: > > > > > Hi all, > > > > > > I was every excited that you guys decided to start Apache Drill, an > open > > > source > > > version of Google's Dremel. I was a contributor of Apache Hive, and > > > skilled in Hadoop > > > related development. We have a nearly 3000-nodes cluster in production, > > one > > > of the > > > largest cluster of the world. > > > > > > Dremel became more and more popular since Google's BigQuery was > > released. I > > > took a interest in this nearly two years ago.This paper > > > (http://research.google.com/pubs/...< > > > http://research.google.com/pubs/pub36632.html> > > > ) has describe how dremel organizes > > > records into nested columnar data. But there’s almost no information > > > about > > > how does dremel store those columns. I have many questions on this > point. > > > > > > > > > 1. It that one file for each column? > > > > > > > I think it is an less important implementation detail. What is important > > that you don't incur IO for non-projected columns. > > > > 2. It seems that Dremel has no restriction that data must store in > local > > > disk, > > > GFS or Bigtable, all of them could be the target storage. If in > > GFS, > > > how does dremel retrieve records from different nodes? > > > How to guarantee the data locality? > > > > > > > Data locality is not mandatory. It is clearly written that data is either > > local or accessed remotely. Search Dremel paper or slide deck for > "in-situ" > > and "local". > > > > > > > 3. The paper refered that "The blocks in each stripe are prefetched > > > asynchronously; the read-ahead cache typically achieves hit rates of > > > 95%. " , does GFS support async prefetching? > > > > > > > > > Have you consider the questions above? What's you answers? > > > > > > BTW, Could I join you guys to start such a cool project? > > > > > > > It is open to everyone > > > > > > > > > > > > > Thanks, > > > Min > > > > > > -- > > > My research interests are distributed systems, parallel computing and > > > bytecode based virtual machine. > > > > > > My profile: > > > http://www.linkedin.com/in/coderplay > > > My blog: > > > http://coderplay.javaeye.com > > > > > >
