I think looking at technology such as MapR is using could address the suboptimal HDFS, there are opportunities to be had with this framework. As for Java, I could pontificate, but to this group I sense this would be pointless... The right tool for the job will trump in the end.
James Lowey > On May 19, 2014, at 5:48 PM, "Ellis H. Wilson III" <[email protected]> wrote: > > On 05/19/2014 03:26 PM, Douglas Eadline wrote: >>> Great write-up by Glenn Lockwood about the state of Hadoop in HPC. It >>> pretty much nails it, and offers an nice overview of the current >>> ongoing efforts to make it relevant in that field. >>> >>> http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html >>> >>> Most spot on thing I've read in a while. Thanks Glenn. >> >> I concur. A good assessment of Hadoop. Most HPC users would >> benefit from reading Glenn's post. I would offer >> a few other thoughts (and a free pdf). > > The write-up is interesting, and I know Doug's PDF (and full book for that > matter, as I was honored to be asked to help review it) to be worth reading > if you want to understand the many subcomponents of the Hadoop project beyond > the buzzwords. Very enlightening history in Chapter 1. > > However, I do take a few issues with the original write-up (not book) in > general: > > 1. I wish "Hadoop" would die. The term that is. Hadoop exists less and less > by the year. HDFS exists. MapReduce exists. The YARN scheduler exists. As > far as I'm concerned, "Hadoop" exists as much as "Big Data" exists. It's too > much wrapped into one thing, and only leads to idealogical conversations > (best left for the suits). It's there for historical reasons, and needs to > be cut out of our language ASAP. It leads to more confusion than anything > else. > > 2. You absolutely do not need to use all of the Hadoop sub-projects to use > MapReduce, which is the real purpose of using "Hadoop" in HPC at all. There > are already perfectly good, high-bandwidth, low-latency, scalable, > semantically-rich file systems in-place and far more mature than HDFS. So > why even bother with HOD (or myhadoop) at all? Just use MapReduce on your > existing files. You don't need HDFS, just a java installation and > non-default URIs. Running a MR job via Torque/PBS/et. al. is reasonably > trivial. FAR more trivial than importing "lakes" of data as Doug refers to > them from your HPC instance to your "Hadoop" (HDFS) instance anyhow, which is > what you have to do with these inane "on-demand" solutions. I will be > addressing this in a paper at ICDCS I present this June in Spain, if any are > going. Just let me know if interested and I'll share a copy of the paper and > times. > > 3. Java really doesn't matter for MR or for YARN. For HDFS (which, as > mentioned, you shouldn't be using in HPC anyhow), yea, I'm not happy with it > being in Java, but for MR, if you are experiencing issues relating to Java, > you shouldn't be coding in MR. It's not a MR-ready job. You're being lazy. > MR should really only be used (in the HPC context) for pre- and > post-computation analytics. For all I care, it could be written in BASIC (or > Visual BASIC, to bring the point home). Steady-state bandwidth to disk in > Java is nearly equivalent to C. Ease of coding and scalability is what makes > MR great. > > Example: > Your 6-week run of your insane climate framework completed on ten-thousand > machines and gobbled up a petabyte of intermediate data. All you really need > to know is where temperatures in some arctic region are rising faster than > some specified value. Spending another two weeks writing a C+MPI/etc program > from scratch to do a fancy grep is a total waste of time and capacity. This > is where MR shines. Half-day code-up, scale-up very fast, get your results, > delete the intermediate data. Science Complete. > > 4. Although I'm not the biggest fan of HDFS, but this post misses the > /entire/ point of HDFS: reliability in the face of (numerous) failures. HDFS > (which has a heritage in the Google FS, which the post also fails to mention, > despite mentioning MR's heritage out of the same shop) really was designed to > be put on crappy hardware and provide really nice throughput. > Responsiveness, POSIX semantics, etc, are all really inappropriate remarks. > It's like complaining about your dump truck not doing 0-60 faster than a > Lamborghini. Not the intention here and thus, I continue to believe it > should not be used in HPC environments when most of them demand the > Lamborghini for 90% of executions. > > Just my (beer-laden) 2c, > > ellis > > -- > Ph.D. Candidate > Department of Computer Science and Engineering > The Pennsylvania State University > www.ellisv3.com > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
