Hi Sriram, >> The I-file concept could possibly be implemented here in a fairly self >> contained way. One >> could even colocate/embed a KFS filesystem with such an alternate >> shuffle, like how MR task temporary space is usually colocated with >> HDFS storage.
> Exactly. >> Does this seem reasonable in any way? > Great. Where do go from here? How do we get a colloborative effort going? Sounds like a JIRA issue should be opened, the approach briefly described, and the first implementation attempt made. Then iterate. I look forward to seeing this! :) Otis -- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm >________________________________ > From: Sriram Rao <srirams...@gmail.com> >To: common-dev@hadoop.apache.org >Sent: Tuesday, May 8, 2012 6:48 PM >Subject: Re: Sailfish > >Dear Andy, > >> From: Andrew Purtell <apurt...@apache.org> >> ... > >> Do you intend this to be a joint project with the Hadoop community or >> a technology competitor? > >As I had said in my email, we are looking for folks to colloborate >with us to help get us integrated with Hadoop. So, to be explicitly >clear, we are intending for this to be a joint project with the >community. > >> Regrettably, KFS is not a "drop in replacement" for HDFS. >> Hypothetically: I have several petabytes of data in an existing HDFS >> deployment, which is the norm, and a continuous MapReduce workflow. >> How do you propose I, practically, migrate to something like Sailfish >> without a major capital expenditure and/or downtime and/or data loss? > >Well, we are not asking for KFS to replace HDFS. One path you could >take is to experiment with Sailfish---use KFS just for the >intermediate data and HDFS for everything else. There is no major >capex :). While you get comfy with pushing intermediate data into a >DFS, we get the ideas added to HDFS. This simplifies deployment >considerations. > >> However, can the Sailfish I-files implementation be plugged in as an >> alternate Shuffle implementation in MRv2 (see MAPREDUCE-3060 and >> MAPREDUCE-4049), > >This'd be great! > >> with necessary additional plumbing for dynamic >> adjustment of reduce task population? And the workbuilder could be >> part of an alternate MapReduce Application Manager? > >It should be part of the AM. (Currently, with our implementation in >Hadoop-0.20.2, the workbuilder serves the role of an AM). > >> The I-file concept could possibly be implemented here in a fairly self >> contained way. One >> could even colocate/embed a KFS filesystem with such an alternate >> shuffle, like how MR task temporary space is usually colocated with >> HDFS storage. > >Exactly. > >> Does this seem reasonable in any way? > >Great. Where do go from here? How do we get a colloborative effort going? > >Best, > >Sriram > >>> From: Sriram Rao <srirams...@gmail.com> >>> To: common-dev@hadoop.apache.org >>> Sent: Tuesday, May 8, 2012 10:32 AM >>> Subject: Project announcement: Sailfish (also, looking for colloborators) >>> >>> Hi, >>> >>> I'd like to announce the release of a new open source project, Sailfish. >>> >>> http://code.google.com/p/sailfish/ >>> >>> Sailfish tries to improve Hadoop-performance, particularly for large-jobs >>> which process TB's of data and run for hours. In building Sailfish, we >>> modify how map-output is handled and transported from map->reduce. >>> >>> The project pages provide more information about the project. >>> >>> We are looking for colloborators who can help get some of the ideas into >>> Apache Hadoop. A possible step forward could be to make "shuffle" phase of >>> Hadoop pluggable. >>> >>> If you are interested in working with us, please get in touch with me. >>> >>> Sriram >> > > > >-- >Best regards, > > - Andy > >Problems worthy of attack prove their worth by hitting back. - Piet >Hein (via Tom White) > > >