On Mon, Aug 27, 2012 at 8:40 PM, Min Zhou <[email protected]> wrote:
> Hi all, > > I was every excited that you guys decided to start Apache Drill, an open > source > version of Google's Dremel. I was a contributor of Apache Hive, and > skilled in Hadoop > related development. We have a nearly 3000-nodes cluster in production, one > of the > largest cluster of the world. > > Dremel became more and more popular since Google's BigQuery was released. I > took a interest in this nearly two years ago.This paper > (http://research.google.com/pubs/...< > http://research.google.com/pubs/pub36632.html> > ) has describe how dremel organizes > records into nested columnar data. But there’s almost no information > about > how does dremel store those columns. I have many questions on this point. > > > 1. It that one file for each column? > I think it is an less important implementation detail. What is important that you don't incur IO for non-projected columns. 2. It seems that Dremel has no restriction that data must store in local > disk, > GFS or Bigtable, all of them could be the target storage. If in GFS, > how does dremel retrieve records from different nodes? > How to guarantee the data locality? > Data locality is not mandatory. It is clearly written that data is either local or accessed remotely. Search Dremel paper or slide deck for "in-situ" and "local". > 3. The paper refered that "The blocks in each stripe are prefetched > asynchronously; the read-ahead cache typically achieves hit rates of > 95%. " , does GFS support async prefetching? > > > Have you consider the questions above? What's you answers? > > BTW, Could I join you guys to start such a cool project? > It is open to everyone > > > Thanks, > Min > > -- > My research interests are distributed systems, parallel computing and > bytecode based virtual machine. > > My profile: > http://www.linkedin.com/in/coderplay > My blog: > http://coderplay.javaeye.com >
