Re: Choosing a Hadoop distribution

Nael Mohammad Thu, 20 Sep 2012 23:06:16 -0700

Just recently VMWARE announced Project Serengeti which is an open sourceOVA based on Apache Hadoop with HDFS, MapReduce, Pig, and Hive to namefew. It requires vSphere I believe to use the OVA.


http://serengeti.cloudfoundry.com/

GitHub source:
https://github.com/vmware-serengeti


-nael

On 9/20/12 10:44 PM, Konstantin Boudnik wrote:

I would add a couple more points to your consideration (may be this is just
me):
   - vendor lock-in:

- when you pick a software make sure that you'd be able to move over to a

       different (yet similar) product offering if you need to.  You are asking
       about CHD's CM here: I don't think it would work with anything else but
       CDH (I am not working there, so I don't know for sure - but it seems
       line a reasonable assumption).

- HW's HDP is providing Ambari for the cluster management needs, that is a

       completely open source technology that you can master if needed and most
       likely use with other stack based on Hadoop (as far as I can see).

     - MapR has quite a bit of proprietary components in their stack, which
       might be beneficial in your particular case or not: this is something
       you have to decide for yourself.

   - what are the road-map of possible distributions? Do they have what you
     need in the future? The case in the point is these guys
         http://www.magnatempusgroup.net/blog/2012/09/05/whats-cooking/
     who are seemingly bringing in in-memory analytics in their upcoming
     release. You might want to follow a big Hadoop conference next month,
     that's likely to have a number of interesting announcements (otherwise,
     what would be the point of such conference ;)

These two would be a pivotal points for me. Hope it helps,
   Cos

On Fri, Sep 21, 2012 at 11:17AM, hadoop wrote:

I Have the same question.
Which version ,Which vender do we choose?


--
hadoop
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On 2012年9月21日Friday at 上午2:22, Aaron Eng wrote:

I'm tasked with creating a guide that instructs on how to choose a Hadoop

distribution from the handful of common options.

Does anyone have any thoughts on what criteria might govern such a

decision?What problem(s) are you trying to solve with Hadoop (and related projects)?

What are your expectations of the technology?

The details beyond that level could take many, many pages to cover.Not all Hadoop distributions are tested the same way, packaged with the

same components, etc. Not all components of a given Hadoop distribution
work with other Hadoop distributions. There are a lot of common things
between distributions which is probably why its difficult to articulate how
to choose one over the another. So when you look at the problem you are
trying to solve and your expectations of the technology, many things may
seem relatively equal and hence you may need to get into some significant
level of detail to pick something that best solves your problem. In some
cases it may be very straightforward as to whether a distribution will meet
your requirements. In other cases, things may look relatively equal across
the board until you drill down to a point where you find differentiation
(or maybe you dont find it). But those would be my critera, articulate the
problem and expectations and compare functionality until you find
differentiation.

On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kwi...@keithwiley.com (mailto:kwi...@keithwiley.com)> wrote:

I'm tasked with creating a guide that instructs on how to choose a Hadoop
distribution from the handful of common options. I'm finding this rather
perplexing. While some of the venders offer additional management software
(Cloudera Manager is an example) I'm unclear whether those packages could
be installed and run irregardless of the underlying Hadoop distribution or
if they are exclusively compatible with their vender's distribution (or if
there's some crossover). I'm also unclear on any other basis for
comparison. For example HortonWorks originated HCatalog (to the best of my
understanding), but that doesn't necessarily mean one needs to use the HW
Hadoop dist. to use HCatalog since it's just a public Apache project anyway
at this point. I'm sure similar statements could be made about MapR or
Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
again, the decision-making process in such a case seems baffling).

And then there's the option of installing the Apache version directly,

always on the table I suppose.

Does anyone have any thoughts on what criteria might govern such a

decision? I'm not trying to get into an argument about which distribution
is best, I'm not even looking for defenses or arguments for one
distribution or another, but rather a notion of what the criteria for
basing such a decision might be.

Thanks.Cheers!________________________________________________________________________________

Keith Wiley kwi...@keithwiley.com (mailto:kwi...@keithwiley.com) keithwiley.com 
(http://keithwiley.com)
music.keithwiley.com (http://music.keithwiley.com)

"It's a fine line between meticulous and obsessive-compulsive and a

slippery
rope between obsessive-compulsive and debilitatingly slow."
-- Keith Wiley

________________________________________________________________________________

Re: Choosing a Hadoop distribution

Reply via email to