Re: [Beowulf] Hadoop's Uncomfortable Fit in HPC

Prentice Bisbal Tue, 20 May 2014 07:51:22 -0700


On 05/19/2014 08:48 PM, Ellis H. Wilson III wrote:

On 05/19/2014 03:26 PM, Douglas Eadline wrote:
Great write-up by Glenn  Lockwood about the state of Hadoop in HPC. It
pretty much nails it, and offers an nice overview of the current
ongoing efforts to make it relevant in that field.
http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
Most spot on thing I've read in a while. Thanks Glenn.
I concur. A good assessment of Hadoop. Most HPC users would
benefit from reading Glenn's post. I would offer
a few other thoughts (and a free pdf).
The write-up is interesting, and I know Doug's PDF (and full book forthat matter, as I was honored to be asked to help review it) to beworth reading if you want to understand the many subcomponents of theHadoop project beyond the buzzwords. Very enlightening history inChapter 1.
However, I do take a few issues with the original write-up (not book)in general:
1. I wish "Hadoop" would die. The term that is. Hadoop exists lessand less by the year. HDFS exists. MapReduce exists. The YARNscheduler exists. As far as I'm concerned, "Hadoop" exists as much as"Big Data" exists. It's too much wrapped into one thing, and onlyleads to idealogical conversations (best left for the suits). It'sthere for historical reasons, and needs to be cut out of our languageASAP. It leads to more confusion than anything else.
2. You absolutely do not need to use all of the Hadoop sub-projects touse MapReduce, which is the real purpose of using "Hadoop" in HPC atall. There are already perfectly good, high-bandwidth, low-latency,scalable, semantically-rich file systems in-place and far more maturethan HDFS. So why even bother with HOD (or myhadoop) at all? Justuse MapReduce on your existing files. You don't need HDFS, just ajava installation and non-default URIs. Running a MR job viaTorque/PBS/et. al. is reasonably trivial. FAR more trivial thanimporting "lakes" of data as Doug refers to them from your HPCinstance to your "Hadoop" (HDFS) instance anyhow, which is what youhave to do with these inane "on-demand" solutions. I will beaddressing this in a paper at ICDCS I present this June in Spain, ifany are going. Just let me know if interested and I'll share a copy ofthe paper and times.

Ellis, it sounds like this would be a good thing to write up as atutorial and share with the list. I'd be interest in getting a copy ofthat paper when it's available.

3. Java really doesn't matter for MR or for YARN. For HDFS (which, asmentioned, you shouldn't be using in HPC anyhow), yea, I'm not happywith it being in Java, but for MR, if you are experiencing issuesrelating to Java, you shouldn't be coding in MR. It's not a MR-readyjob. You're being lazy. MR should really only be used (in the HPCcontext) for pre- and post-computation analytics. For all I care, itcould be written in BASIC (or Visual BASIC, to bring the point home).Steady-state bandwidth to disk in Java is nearly equivalent to C.Ease of coding and scalability is what makes MR great.
Example:
Your 6-week run of your insane climate framework completed onten-thousand machines and gobbled up a petabyte of intermediate data.All you really need to know is where temperatures in some arcticregion are rising faster than some specified value. Spending anothertwo weeks writing a C+MPI/etc program from scratch to do a fancy grepis a total waste of time and capacity. This is where MR shines.Half-day code-up, scale-up very fast, get your results, delete theintermediate data. Science Complete.
4. Although I'm not the biggest fan of HDFS, but this post misses the/entire/ point of HDFS: reliability in the face of (numerous)failures. HDFS (which has a heritage in the Google FS, which the postalso fails to mention, despite mentioning MR's heritage out of thesame shop) really was designed to be put on crappy hardware andprovide really nice throughput. Responsiveness, POSIX semantics, etc,are all really inappropriate remarks. It's like complaining aboutyour dump truck not doing 0-60 faster than a Lamborghini. Not theintention here and thus, I continue to believe it should not be usedin HPC environments when most of them demand the Lamborghini for 90%of executions.
Just my (beer-laden) 2c,


I drunk man is a sober man telling the truth.


ellis


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Hadoop's Uncomfortable Fit in HPC

Reply via email to