As an aside, if you ask for the white paper you get a PDF that
over-exaggerates the limits of Hadoop.
http://info.platform.com/rs/platform/images/Whitepaper_Top5ChallengesforHadoopMapReduceintheEnterprise.pdf
Mostly focusing on a critique of the scheduler -which MR-279 will fix in
Hadoop 0.23- they say
"It is designed to be used by IT departments that
have an army of developers to help fix any issues they en-
counter"
I don't believe this. Cloudera and Hortonworks will do this for a fee
-as will Platform. In most organisations the R&D effort doesn't go into
the Hadoop codebase, it goes into writing the analysis code, which is
why things like Pig and Hive help -they make it easier.
"Their (Clouderas) distribution is based on open source
which is still an unproven large-scale enterprise full stack
solution. There are many shortcomings in the open source
distribution, including the workload management capa-
bilities.
Other open source commercial distributions are
emerging, with IBM and EMC entering the marketplace.
However, all of these offerings are based on open source
code and inevitably inherit the strengths and weaknesses
of that code base and architectural design. "
Ted will point out that MapR's MR engine isn't limited, as will Brisk,
while Arun will view that statement in the past tense. Doug and Tom will
pick up on the word "unproven" too. Which enterprises plan to have
Hadoop clusters bigger than Yahoo or Facebook?
Furthermore, as Platform only puts in their own scheduler, leaving the
filesystem alone, it's a bit weak to critique the architecture of the
open source distro. Not a way to make friends -or get your bug fixes in.
Or indeed, promise better scalability.
"Therefore they cannot meet the enterprise–class requirements for ”big
data” problems as already mentioned."
This is daft. The only thing platform brings to the table is a scheduler
that works with "legacy" grid workloads and a console to see what's
going on. I don't see that being tangibly more enterprise-class than the
existing JT -which does persist after an outage. With HDFS underneath a
new scheduler doesn't even remove the filesystem SPOFs, so the only way
to get an HA cluster is to swap in a premium filesystem.
The other thing the marketing blurb gets wrong is its claim that Hadoop
only works with one distributed file system. Not so. You can read in and
out of any filesystem, file:// being a handy one what works with NFS
mount points too.
Overall, a disappointing white paper, as all it can to do criticise open
source Hadoop is spread fear about the #of developers you need to
maintain it, and the limitations of the Hadoop scheduler vs their
Scheduler -that being the only that differs from the Platform product
from the full OSS release.
I missed a talk at the local university by a Platform sales rep last
month, though I did get to offend one of the authors of condor team
instead [1]. by pointing out that all grid schedulers contain a major
assumption: that storage access times are constant across your cluster.
It is if you can pay for something like GPFS, but you don't get 50TB of
GPFS storage for $2500, which is what adding 25*2TB SATA drives would
cost if you stuck them on your compute nodes; $7500 for a fully
replicated 50TB. That's why I'm not a fan of grid systems -cost of
storage and networking aren't taken into account. Then there's the
availablity issues with the larger filesystems, that are a topic for
another day.
I look forward to them giving a talk at any forthcoming London HUG event
and will try to do a follow-on talk introducing MR-279 and arguing in
favour of an OSS solution because the turnaround time on defects is faster.
-Steve
[1] (Miron Livny ), facing the camera, two to the left of Sergey Melnik
with the camera -the author of Dremel: http://flic.kr/p/akUzE7