Jagane, I understand your use case, I think, and so here are my thoughts, inline:
>1. Hbase support, i.e. working scale tested Append and Hflush in HDFS Absolutely. Hbase (and other components of the stack that do not follow the MapReduce paradigm) are increasingly important. It is important to realize that as Hadoop gains popularity, people will look at consolidating their workloads, and are going to need at least the baseline features such as append and flush to achieve that. >2. Built in support for the cloud. (Whirr is interesting. Ambari more so, >but both fall short.) Not very sure. If by "support for the cloud" means ability to provision atop a hypervisor, adding or removing instances etc, I think there are other approaches proven in the industry. >3. Assumption that 10GBE is around the corner (really, this time), and >hence >storage locality is irrelevant Yes, I have been shouting over the rooftops about this for quite some time now. >4. Storage efficiency is important. Alternatives to a 3 replica HDFS, such >as erasure code, should be first class citizens in this distro. Absolutely. Usable space is much more important than raw space. >5. H/A for the NN Yes, it's a must. Some proprietary file systems that provide o.a.h.f.FileSystem API have this feature already, and getting a lot of positive press recently. > >Such a distro would be an outstanding thing for the Hadoop community. I >think 0.20.20x is the closest to this, but I am not sure. Other than the merge of 0.20-append patches into 0.20.205, I am not aware of any other changes that address any of your requirements 1-5. >My hope is that this discussion will get some input from users of Hadoop. >I >may be wrong, as this may be the wrong forum for this discussion. (The >only >thing I really accomplished was to evoke a hurried and semi-infuriated >Sunday afternoon private email response from some key players in the >Hadoop >community). Yeah, some key players in hadoop community are infuriated on Sunday afternoons, based on my informal sentiment analysis of twitter streams. ;-) >My ultimate goal is to influence the product managers at Hadoop startups >and >established companies to assign high priorities to these items. Believe me, I know some product managers at Hadoop startups and established companies, who have a slide highlighting most of the above already. >In short, I don't own the whip, the buggy, or the horse ... but I am >trying >to crack the whip. :-) Ha Ha ! Interesting analogy. But this is open-source world. Here no one "owns" (or at least, not supposed to own) the whip, buggy, or horse. So, you are not alone :-) >Milind - I do look forward to your input as to the importance of these >features, and whether these are feasible in one of the source branches in >the near future. Indeed, these are feasible. Indeed these are important, and indeed they will be in one of the source branches in future. I don¹t know about *near* future, though. - Milind --- Milind Bhandarkar Greenplum Labs, EMC (Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any organization, past or present, the author might be affiliated with.)
