it`s address is https://github.com/alibaba/mdrill ,i think some of the information or desion maybe help full for apache drill dev.
Which is like apache drill or google power drill, it is base on hadoop,lucene,solr,jstorm Now in my project , has 10 tables, 47760506482 rows ,80~400columns. (run on 10 mathines, permachine ram:48GB,12*2TB disk) Some of the search example.,like bellows: select count(*) from r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811' limit 0,100 _____ totalRecords:1 count(*) 11108914892 times taken 4.031 seconds select sum(landing_uv) from r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811' limit 0,100 _____ totalRecords:1 sum(landing_uv) 2.07678497E8 times taken 56.081 seconds select dist(user_id) from r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811' limit 0,100 _____ totalRecords:1 dist(user_id) 1483008.0 times taken 246.147 seconds select thedate,count(*) as cnt from r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811' group by thedate order by cnt desc limit 0,3 _____ totalRecords:118 thedate cnt 20130803 158301304 20130802 157748487 20130725 157047045 times taken 34.727 seconds select thedate,user_id,count(*) as cnt from r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811' group by thedate,user_id order by cnt desc limit 0,3 _____ totalRecords:10010 thedate user_id cnt 20130725 725677994 194397 20130725 101450072 192650 20130701 101450072 189107 times taken 149.316 seconds select thedate,category_level1,count(*) as cnt from r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811' group by thedate,category_level1 order by cnt desc limit 0,3 _____ totalRecords:10010 thedate category_level1 cnt 20130803 16 26487658 20130802 16 26306163 20130725 16 26128576 times taken 94.989 seconds select thedate,category_level1,category_level2,count(*) as cnt from r_rpt_cps_luna_item where thedate >='20130416' and thedate <'20130811' group by thedate,category_level1,category_level2 order by cnt desc limit 0,3 _____ totalRecords:10010 thedate category_level1 category_level2 cnt 20130725 16 50010850 7315606 20130803 16 50010850 7006255 20130802 16 50010850 6936059 times taken 288.885 seconds chinese introduce 1:mdrill旨在帮助用户在几秒到几十秒的时间内,分析百亿级别的任意维度组合的数 据。 2:mdrill是一个分布式的在线分析查询系统,基于hadoop,lucene,solr,jstorm等开源 系统作为实现,基于SQL的查询语法。 mdrill是一个能够对大量数据进行分布式处理的 软件框架。mdrill是快速的高性能的,他的底层因使用了索引、列式存储、以及内存 cache等技 术,使得数据扫描的速度大为增加。mdrill是分布式的,它以并行的方式工 作,通过并行处理加快处理速度。 3:基于mdrill应用的adhoc项目,使用了10台机器,存储了400亿的数据 ==>每次扫描30亿的行数,响应时间在20秒~120秒左右(取决不同的查询条件与扫描的 列数)。 ==>对100亿数据进行count(*),耗时为2秒,单列sum耗时在25秒,按照日期分组求 count和sum耗时47秒,按照用户id分组并且按照成交笔数排序去TopN 耗时 243秒。