Hi all, I was every excited that you guys decided to start Apache Drill, an open source version of Google's Dremel. I was a contributor of Apache Hive, and skilled in Hadoop related development. We have a nearly 3000-nodes cluster in production, one of the largest cluster of the world.
Dremel became more and more popular since Google's BigQuery was released. I took a interest in this nearly two years ago.This paper (http://research.google.com/pubs/...<http://research.google.com/pubs/pub36632.html> ) has describe how dremel organizes records into nested columnar data. But there’s almost no information about how does dremel store those columns. I have many questions on this point. 1. It that one file for each column? 2. It seems that Dremel has no restriction that data must store in local disk, GFS or Bigtable, all of them could be the target storage. If in GFS, how does dremel retrieve records from different nodes? How to guarantee the data locality? 3. The paper refered that "The blocks in each stripe are prefetched asynchronously; the read-ahead cache typically achieves hit rates of 95%. " , does GFS support async prefetching? Have you consider the questions above? What's you answers? BTW, Could I join you guys to start such a cool project? Thanks, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
