On 05/19/2014 08:48 PM, Ellis H. Wilson III wrote:
On 05/19/2014 03:26 PM, Douglas Eadline wrote:
Great write-up by Glenn Lockwood about the state of Hadoop in HPC. It
pretty much nails it, and offers an nice overview of the current
ongoing efforts to make it relevant in that field.
http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
Most spot on thing I've read in a while. Thanks Glenn.
I concur. A good assessment of Hadoop. Most HPC users would
benefit from reading Glenn's post. I would offer
a few other thoughts (and a free pdf).
The write-up is interesting, and I know Doug's PDF (and full book for
that matter, as I was honored to be asked to help review it) to be
worth reading if you want to understand the many subcomponents of the
Hadoop project beyond the buzzwords. Very enlightening history in
Chapter 1.
However, I do take a few issues with the original write-up (not book)
in general:
1. I wish "Hadoop" would die. The term that is. Hadoop exists less
and less by the year. HDFS exists. MapReduce exists. The YARN
scheduler exists. As far as I'm concerned, "Hadoop" exists as much as
"Big Data" exists. It's too much wrapped into one thing, and only
leads to idealogical conversations (best left for the suits). It's
there for historical reasons, and needs to be cut out of our language
ASAP. It leads to more confusion than anything else.
2. You absolutely do not need to use all of the Hadoop sub-projects to
use MapReduce, which is the real purpose of using "Hadoop" in HPC at
all. There are already perfectly good, high-bandwidth, low-latency,
scalable, semantically-rich file systems in-place and far more mature
than HDFS. So why even bother with HOD (or myhadoop) at all? Just
use MapReduce on your existing files. You don't need HDFS, just a
java installation and non-default URIs. Running a MR job via
Torque/PBS/et. al. is reasonably trivial. FAR more trivial than
importing "lakes" of data as Doug refers to them from your HPC
instance to your "Hadoop" (HDFS) instance anyhow, which is what you
have to do with these inane "on-demand" solutions. I will be
addressing this in a paper at ICDCS I present this June in Spain, if
any are going. Just let me know if interested and I'll share a copy of
the paper and times.
Ellis, it sounds like this would be a good thing to write up as a
tutorial and share with the list. I'd be interest in getting a copy of
that paper when it's available.
3. Java really doesn't matter for MR or for YARN. For HDFS (which, as
mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy
with it being in Java, but for MR, if you are experiencing issues
relating to Java, you shouldn't be coding in MR. It's not a MR-ready
job. You're being lazy. MR should really only be used (in the HPC
context) for pre- and post-computation analytics. For all I care, it
could be written in BASIC (or Visual BASIC, to bring the point home).
Steady-state bandwidth to disk in Java is nearly equivalent to C.
Ease of coding and scalability is what makes MR great.
Example:
Your 6-week run of your insane climate framework completed on
ten-thousand machines and gobbled up a petabyte of intermediate data.
All you really need to know is where temperatures in some arctic
region are rising faster than some specified value. Spending another
two weeks writing a C+MPI/etc program from scratch to do a fancy grep
is a total waste of time and capacity. This is where MR shines.
Half-day code-up, scale-up very fast, get your results, delete the
intermediate data. Science Complete.
4. Although I'm not the biggest fan of HDFS, but this post misses the
/entire/ point of HDFS: reliability in the face of (numerous)
failures. HDFS (which has a heritage in the Google FS, which the post
also fails to mention, despite mentioning MR's heritage out of the
same shop) really was designed to be put on crappy hardware and
provide really nice throughput. Responsiveness, POSIX semantics, etc,
are all really inappropriate remarks. It's like complaining about
your dump truck not doing 0-60 faster than a Lamborghini. Not the
intention here and thus, I continue to believe it should not be used
in HPC environments when most of them demand the Lamborghini for 90%
of executions.
Just my (beer-laden) 2c,
I drunk man is a sober man telling the truth.
ellis
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf