A list of questions on Dremel (or Apache Drill)'s columnar storage

Min Zhou Mon, 27 Aug 2012 10:41:12 -0700

Hi all,

I was every excited that you guys decided to start  Apache Drill, an open
source
version of Google's Dremel.  I was a contributor of Apache Hive, and
skilled in Hadoop
related development. We have a nearly 3000-nodes cluster in production, one
of the
largest cluster of the world.


Dremel became more and more popular since Google's BigQuery was released. I
took a interest in this nearly two years ago.This paper
(http://research.google.com/pubs/...<http://research.google.com/pubs/pub36632.html>
) has describe how dremel organizes
records into nested columnar data.  But  there’s almost no information
about
how does dremel store those columns. I have many questions on this point.


   1. It that one file for each column?
   2. It seems that Dremel has no restriction that data must store in local
   disk,
    GFS or Bigtable,  all of them could be the target storage.  If in GFS,
   how does dremel retrieve records from different nodes?
   How to guarantee the data locality?
   3. The paper refered that "The blocks in each stripe are prefetched
   asynchronously; the read-ahead cache typically achieves hit rates of
   95%. " , does GFS support async prefetching?


Have you consider the questions above? What's you answers?

BTW,  Could I join you guys to start such a cool project?


Thanks,
Min

-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

A list of questions on Dremel (or Apache Drill)'s columnar storage

Reply via email to