As an aside, if you ask for the white paper you get a PDF that over-exaggerates the limits of Hadoop.

http://info.platform.com/rs/platform/images/Whitepaper_Top5ChallengesforHadoopMapReduceintheEnterprise.pdf

Mostly focusing on a critique of the scheduler -which MR-279 will fix in Hadoop 0.23- they say

  "It is designed to be used by IT departments that
  have an army of developers to help fix any issues they en-
  counter"

I don't believe this. Cloudera and Hortonworks will do this for a fee -as will Platform. In most organisations the R&D effort doesn't go into the Hadoop codebase, it goes into writing the analysis code, which is why things like Pig and Hive help -they make it easier.


  "Their (Clouderas) distribution is based on open source
  which is still an unproven large-scale enterprise full stack
  solution. There are many shortcomings in the open source
  distribution,  including the workload management capa-
  bilities.

  Other open source commercial distributions are
  emerging, with IBM and EMC entering the marketplace.
  However, all of these offerings are based on open source
  code and inevitably inherit the strengths and weaknesses
  of that code base and architectural design. "

Ted will point out that MapR's MR engine isn't limited, as will Brisk, while Arun will view that statement in the past tense. Doug and Tom will pick up on the word "unproven" too. Which enterprises plan to have Hadoop clusters bigger than Yahoo or Facebook?

Furthermore, as Platform only puts in their own scheduler, leaving the filesystem alone, it's a bit weak to critique the architecture of the open source distro. Not a way to make friends -or get your bug fixes in. Or indeed, promise better scalability.

"Therefore they cannot meet the enterprise–class requirements for ”big
data” problems as already mentioned."

This is daft. The only thing platform brings to the table is a scheduler that works with "legacy" grid workloads and a console to see what's going on. I don't see that being tangibly more enterprise-class than the existing JT -which does persist after an outage. With HDFS underneath a new scheduler doesn't even remove the filesystem SPOFs, so the only way to get an HA cluster is to swap in a premium filesystem.

The other thing the marketing blurb gets wrong is its claim that Hadoop only works with one distributed file system. Not so. You can read in and out of any filesystem, file:// being a handy one what works with NFS mount points too.

Overall, a disappointing white paper, as all it can to do criticise open source Hadoop is spread fear about the #of developers you need to maintain it, and the limitations of the Hadoop scheduler vs their Scheduler -that being the only that differs from the Platform product from the full OSS release.

I missed a talk at the local university by a Platform sales rep last month, though I did get to offend one of the authors of condor team instead [1]. by pointing out that all grid schedulers contain a major assumption: that storage access times are constant across your cluster. It is if you can pay for something like GPFS, but you don't get 50TB of GPFS storage for $2500, which is what adding 25*2TB SATA drives would cost if you stuck them on your compute nodes; $7500 for a fully replicated 50TB. That's why I'm not a fan of grid systems -cost of storage and networking aren't taken into account. Then there's the availablity issues with the larger filesystems, that are a topic for another day.

I look forward to them giving a talk at any forthcoming London HUG event and will try to do a follow-on talk introducing MR-279 and arguing in favour of an OSS solution because the turnaround time on defects is faster.

-Steve

[1] (Miron Livny ), facing the camera, two to the left of Sergey Melnik with the camera -the author of Dremel: http://flic.kr/p/akUzE7

Reply via email to