Hi Arkady, With my changes that should be available soon, the user can specify the followings:
1. Mapper (a java mapper class or an executable) 2. Reducer (a Java reducer class or an executable). Reduce NONE will be introduced as per HADOOP-1216. 3. Inputformat class 4. OutputFormat class 5. Partitioner I don't understand what do you mean by (input partitioner, splitter for reduce, sorter for reduce). Can you explain? Hadoop has a collection of built-in classes: IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper, LongSumReducer TextInputFormat, SequenceFileInputFormat, TextOutputFormat, SequenceFileOutputFormat, NullOutputFormat Some more coming soon: SequenceFileToLineInputFormat, KeyValueTextInputFormat. We can add IdentityMapper/IdentityReducer/ KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop Streaming. Runping > -----Original Message----- > From: Arkady Borkovsky [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 10, 2007 1:24 PM > To: [email protected] > Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to > specify a partitioner > > To extend this, > I'd suggest that Hadoop Streaming is interfaced in the following way: > > Map reduce process is parameterized by several algorithms. > This includes at least > 1. mapper > 2. reducer (including special case of NONE) > 3. input format > 4. input partitioner > 5. splitter for reduce > 6. sorter for reduce > > The current Hadoop Streaming allows to specify only the 1 and 2 (and > gives a limited control on 3) > Nicely, the 1 (mapper) can be specified both as a command to stream the > data through, or a Java class to use. > > It would make a lot of sense to > (a) allow to specify a Java class that implements each of these > (b) provide meaningful defaults, so that the user of Hadoop Streaming > does need to worry about details irrelevant for her specific task. > (c) provide a set of useful classes so that the user can pick the > necessary ones rather than re-implementing same things again and again. > (c.1) make sure that there is a convenient short-hand to specify these > predefined classes (e.g. without long package prefix) > > In particular, it would be good to have predefined Identity mapper and > reducer (the mapper actually is available now), reducers that provide > simple aggregation (like in Abacus), input formats for commonly used > formats (including CSV, flat XML, etc), sorter different from splitter, > etc. > > Then "Streaming should allow to specify a partitioner" would be > automatically resolved as a special case. > It might be better to implement the whole consistent approach rather > then do special cases one by one. > > -- ab > > > On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote: > > > Streaming should allow to specify a partitioner > > ----------------------------------------------- > > > > Key: HADOOP-1215 > > URL: https://issues.apache.org/jira/browse/HADOOP-1215 > > Project: Hadoop > > Issue Type: Improvement > > Reporter: Runping Qi > > > > > > > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > >
