> Great write-up by Glenn Lockwood about the state of Hadoop in HPC. It > pretty much nails it, and offers an nice overview of the current > ongoing efforts to make it relevant in that field. > > http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html > > Most spot on thing I've read in a while. Thanks Glenn.
I concur. A good assessment of Hadoop. Most HPC users would benefit from reading Glenn's post. I would offer a few other thoughts (and a free pdf). First, I have recently co-authored a book on Apache Hadoop YARN (i.e. Hadoop V2). As a hardcore HPC dude, it was an interesting exercise. The most important thing to remember about Hadoop is that it is changing and evolving and is not necessarily synonymous with MapReduce. Hadoop V2 (with YARN) is more of a general purpose "cluster OS" on which applications frameworks can be built. Yes it is written in Java, and yes HDFS seems down right weird, and yes it seems to have some different ways of doing things, but there are valid reasons for the design. To help understand Hadoop, I recommend reading the first chapter of the book (available for free) as it provides history and rational for Hadoop development (I think this is the first time it has been carefully written down) You can get a free pdf copy of this chapter from: http://ptgmedia.pearsoncmg.com/images/9780321934505/samplepages/0321934504.pdf The goal of Hadoop is analysis of all types and forms of large unrelated data sets. The term "Hadoop data lake" is becoming a new buzzword. As you might imagine, a massive "data lake" is where all organizational data is placed (dumped, copied, archived) for processing. Some of the processing might be batch oriented MapReduce, or real-time MapReduce with Apache Tez, or graph processing using Apache Giraph, or in-memory processing using Apache Spark, or even MPI (though not optimal), or any other framework you care to create (in any language you desire, maybe with a little Java glue). And the frameworks can use Hadoop services such as data locality or dynamic run-time resource allocation/de-allocation. Personally, I see parts of the Hadoop ecosystem creeping into HPC. Intel's work with Lustre and Hadoop is an example of Hadoop tools accessing the "scientific data lake". I think this will happen as needed and where it makes sense based on the data size and available tools. I see a stronger migration of HPC methods into the "non-HPC data lakes." Things like Hadoop V2 now make this possible. Notice that other than this sentence, I did not use the term "Big Data" in this email. -- Doug > Cheers, > -- > Kilian > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Mailscanner: Clean > -- Doug -- Mailscanner: Clean _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
