Re: [Beowulf] Hadoop's Uncomfortable Fit in HPC

James Lowey Mon, 19 May 2014 20:29:16 -0700

I think looking at technology such as MapR is using could address the 
suboptimal HDFS, there are opportunities to be had with this framework. As for 
Java, I could pontificate, but to this group I sense this would be pointless... 
The right tool for the job will trump in the end.



James Lowey



> On May 19, 2014, at 5:48 PM, "Ellis H. Wilson III" <[email protected]> wrote:
> 
> On 05/19/2014 03:26 PM, Douglas Eadline wrote:
>>> Great write-up by Glenn  Lockwood about the state of Hadoop in HPC. It
>>> pretty much nails it, and offers an nice overview of the current
>>> ongoing efforts to make it relevant in that field.
>>> 
>>> http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
>>> 
>>> Most spot on thing I've read in a while. Thanks Glenn.
>> 
>> I concur. A good assessment of Hadoop. Most HPC users would
>> benefit from reading Glenn's post. I would offer
>> a few other thoughts (and a free pdf).
> 
> The write-up is interesting, and I know Doug's PDF (and full book for that 
> matter, as I was honored to be asked to help review it) to be worth reading 
> if you want to understand the many subcomponents of the Hadoop project beyond 
> the buzzwords.  Very enlightening history in Chapter 1.
> 
> However, I do take a few issues with the original write-up (not book) in 
> general:
> 
> 1. I wish "Hadoop" would die.  The term that is.  Hadoop exists less and less 
> by the year.  HDFS exists.  MapReduce exists.  The YARN scheduler exists.  As 
> far as I'm concerned, "Hadoop" exists as much as "Big Data" exists.  It's too 
> much wrapped into one thing, and only leads to idealogical conversations 
> (best left for the suits).  It's there for historical reasons, and needs to 
> be cut out of our language ASAP.  It leads to more confusion than anything 
> else.
> 
> 2. You absolutely do not need to use all of the Hadoop sub-projects to use 
> MapReduce, which is the real purpose of using "Hadoop" in HPC at all.  There 
> are already perfectly good, high-bandwidth, low-latency, scalable, 
> semantically-rich file systems in-place and far more mature than HDFS.  So 
> why even bother with HOD (or myhadoop) at all?  Just use MapReduce on your 
> existing files.  You don't need HDFS, just a java installation and 
> non-default URIs.  Running a MR job via Torque/PBS/et. al. is reasonably 
> trivial.  FAR more trivial than importing "lakes" of data as Doug refers to 
> them from your HPC instance to your "Hadoop" (HDFS) instance anyhow, which is 
> what you have to do with these inane "on-demand" solutions.  I will be 
> addressing this in a paper at ICDCS I present this June in Spain, if any are 
> going.  Just let me know if interested and I'll share a copy of the paper and 
> times.
> 
> 3. Java really doesn't matter for MR or for YARN.  For HDFS (which, as 
> mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy with it 
> being in Java, but for MR, if you are experiencing issues relating to Java, 
> you shouldn't be coding in MR.  It's not a MR-ready job.  You're being lazy.  
> MR should really only be used (in the HPC context) for pre- and 
> post-computation analytics.  For all I care, it could be written in BASIC (or 
> Visual BASIC, to bring the point home). Steady-state bandwidth to disk in 
> Java is nearly equivalent to C.  Ease of coding and scalability is what makes 
> MR great.
> 
> Example:
> Your 6-week run of your insane climate framework completed on ten-thousand 
> machines and gobbled up a petabyte of intermediate data. All you really need 
> to know is where temperatures in some arctic region are rising faster than 
> some specified value.  Spending another two weeks writing a C+MPI/etc program 
> from scratch to do a fancy grep is a total waste of time and capacity.  This 
> is where MR shines.  Half-day code-up, scale-up very fast, get your results, 
> delete the intermediate data. Science Complete.
> 
> 4. Although I'm not the biggest fan of HDFS, but this post misses the 
> /entire/ point of HDFS: reliability in the face of (numerous) failures.  HDFS 
> (which has a heritage in the Google FS, which the post also fails to mention, 
> despite mentioning MR's heritage out of the same shop) really was designed to 
> be put on crappy hardware and provide really nice throughput.  
> Responsiveness, POSIX semantics, etc, are all really inappropriate remarks.  
> It's like complaining about your dump truck not doing 0-60 faster than a 
> Lamborghini.  Not the intention here and thus, I continue to believe it 
> should not be used in HPC environments when most of them demand the 
> Lamborghini for 90% of executions.
> 
> Just my (beer-laden) 2c,
> 
> ellis
> 
> -- 
> Ph.D. Candidate
> Department of Computer Science and Engineering
> The Pennsylvania State University
> www.ellisv3.com
> _______________________________________________
> Beowulf mailing list, [email protected] sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Hadoop's Uncomfortable Fit in HPC

Reply via email to