I think a good starting point for that distribution guide would be a feature matrix where all reasonable distributions could be compaired.
There could be metrics for cross cutting concerns like performance, security, etc. referring to real benchmarks. Upon this one could derive (maybe by additional explainations) which distribution fits in a certain use case the best. Though, most important is that this comparison is not biased but indepedent. regards Chris ________________________________ Von: Keith Wiley <kwi...@keithwiley.com> An: general@hadoop.apache.org Gesendet: 5:45 Freitag, 21.September 2012 Betreff: Re: Choosing a Hadoop distribution Thanks, that all seems quite reasonable I suppose. Cheers! On Sep 20, 2012, at 11:22 , Aaron Eng wrote: >> I'm tasked with creating a guide that instructs on how to choose a Hadoop > distribution from the handful of common options. >> Does anyone have any thoughts on what criteria might govern such a > decision? > > What problem(s) are you trying to solve with Hadoop (and related projects)? > What are your expectations of the technology? > > The details beyond that level could take many, many pages to cover. > > Not all Hadoop distributions are tested the same way, packaged with the > same components, etc. Not all components of a given Hadoop distribution > work with other Hadoop distributions. There are a lot of common things > between distributions which is probably why its difficult to articulate how > to choose one over the another. So when you look at the problem you are > trying to solve and your expectations of the technology, many things may > seem relatively equal and hence you may need to get into some significant > level of detail to pick something that best solves your problem. In some > cases it may be very straightforward as to whether a distribution will meet > your requirements. In other cases, things may look relatively equal across > the board until you drill down to a point where you find differentiation > (or maybe you dont find it). But those would be my critera, articulate the > problem and expectations and compare functionality until you find > differentiation. > > > On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kwi...@keithwiley.com> wrote: > >> I'm tasked with creating a guide that instructs on how to choose a Hadoop >> distribution from the handful of common options. I'm finding this rather >> perplexing. While some of the venders offer additional management software >> (Cloudera Manager is an example) I'm unclear whether those packages could >> be installed and run irregardless of the underlying Hadoop distribution or >> if they are exclusively compatible with their vender's distribution (or if >> there's some crossover). I'm also unclear on any other basis for >> comparison. For example HortonWorks originated HCatalog (to the best of my >> understanding), but that doesn't necessarily mean one needs to use the HW >> Hadoop dist. to use HCatalog since it's just a public Apache project anyway >> at this point. I'm sure similar statements could be made about MapR or >> Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so >> again, the decision-making process in such a case seems baffling). >> >> And then there's the option of installing the Apache version directly, >> always on the table I suppose. >> >> Does anyone have any thoughts on what criteria might govern such a >> decision? I'm not trying to get into an argument about which distribution >> is best, I'm not even looking for defenses or arguments for one >> distribution or another, but rather a notion of what the criteria for >> basing such a decision might be. >> >> Thanks. >> >> Cheers! ________________________________________________________________________________ Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com "And what if we picked the wrong religion? Every week, we're just making God madder and madder!" -- Homer Simpson ________________________________________________________________________________