I'm tasked with creating a guide that instructs on how to choose a Hadoop distribution from the handful of common options. I'm finding this rather perplexing. While some of the venders offer additional management software (Cloudera Manager is an example) I'm unclear whether those packages could be installed and run irregardless of the underlying Hadoop distribution or if they are exclusively compatible with their vender's distribution (or if there's some crossover). I'm also unclear on any other basis for comparison. For example HortonWorks originated HCatalog (to the best of my understanding), but that doesn't necessarily mean one needs to use the HW Hadoop dist. to use HCatalog since it's just a public Apache project anyway at this point. I'm sure similar statements could be made about MapR or Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so again, the decision-making process in such a case seems baffling).
And then there's the option of installing the Apache version directly, always on the table I suppose. Does anyone have any thoughts on what criteria might govern such a decision? I'm not trying to get into an argument about which distribution is best, I'm not even looking for defenses or arguments for one distribution or another, but rather a notion of what the criteria for basing such a decision might be. Thanks. Cheers! ________________________________________________________________________________ Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com "It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow." -- Keith Wiley ________________________________________________________________________________