Responses are inline below.
> -----Original Message----- > From: Arkady Borkovsky [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 11, 2007 5:09 PM > To: [email protected] > Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to > specify a partitioner > > This looks very good. > > Probably my "splitter for reduce" is what you call "partitioner" -- a > function that given a "key" K and the number of "partitions" N produces > the I in (0..N-1). > [Runping Qi] Yes, that is exactly what the partitioner does. > "sorter for reduce" -- the function that defines how records with the > same key are ordered when presented to reduce. > [Runping Qi] This can be achieved through partitioner class for streaming. Suppose your inputs are a list of records, each with multiple fields. Logically you want to group them by fields 1,2 and 3, but you also want the records sorted by fields 4 and 5 within each group. What you can do is to have your mapper compose keys using fields 1,2,3,4 and 5, and have your partitioner partition by fields 1,2, and 3 only. > It may be useful to be able to specify how the Map input is > partitioned. It is part of InputFormat. However, from a naive user > perspective specifying how you read records and find record boundaries > is very different from specifying how to partition the input. (I > agree this is not high priority issue -- as long as I can specify the > number of map tasks I'd like to have). > > Should the list of specifiable parameters also include Combiner class? > [Runping Qi] A good point. Streaming already allows that. > And once again, it would be great if Abacus classes where available in > the reworked Streaming through exactly same mechanism without addition > conventions. > E.g. I'd like to have tab separated <key, value> as input, > IdentityMapper, and the Abacus class that gives me the sum, the count, > and std of values for each key. > (It is https://issues.apache.org/jira/browse/HADOOP-1247) > [Runping Qi] That will be the work of https://issues.apache.org/jira/browse/HADOOP-1247 It is coming soon. > -- ab > > On Apr 10, 2007, at 1:58 PM, Runping Qi wrote: > > > > > Hi Arkady, > > > > With my changes that should be available soon, the user can specify > > the followings: > > > > 1. Mapper (a java mapper class or an executable) > > 2. Reducer (a Java reducer class or an executable). Reduce NONE will be > > introduced as per HADOOP-1216. > > 3. Inputformat class > > 4. OutputFormat class > > 5. Partitioner > > > > I don't understand what do you mean by (input partitioner, splitter for > > reduce, sorter for reduce). Can you explain? > > > > Hadoop has a collection of built-in classes: > > > > IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper, > > LongSumReducer > > > > TextInputFormat, SequenceFileInputFormat, TextOutputFormat, > > SequenceFileOutputFormat, NullOutputFormat > > > > Some more coming soon: > > > > SequenceFileToLineInputFormat, KeyValueTextInputFormat. > > > > We can add IdentityMapper/IdentityReducer/ > > KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop > > Streaming. > > > > > > Runping > > > > > > > > > >> -----Original Message----- > >> From: Arkady Borkovsky [mailto:[EMAIL PROTECTED] > >> Sent: Tuesday, April 10, 2007 1:24 PM > >> To: [email protected] > >> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to > >> specify a partitioner > >> > >> To extend this, > >> I'd suggest that Hadoop Streaming is interfaced in the following way: > >> > >> Map reduce process is parameterized by several algorithms. > >> This includes at least > >> 1. mapper > >> 2. reducer (including special case of NONE) > >> 3. input format > >> 4. input partitioner > >> 5. splitter for reduce > >> 6. sorter for reduce > >> > >> The current Hadoop Streaming allows to specify only the 1 and 2 (and > >> gives a limited control on 3) > >> Nicely, the 1 (mapper) can be specified both as a command to stream > >> the > >> data through, or a Java class to use. > >> > >> It would make a lot of sense to > >> (a) allow to specify a Java class that implements each of these > >> (b) provide meaningful defaults, so that the user of Hadoop Streaming > >> does need to worry about details irrelevant for her specific task. > >> (c) provide a set of useful classes so that the user can pick the > >> necessary ones rather than re-implementing same things again and > >> again. > >> (c.1) make sure that there is a convenient short-hand to specify these > >> predefined classes (e.g. without long package prefix) > >> > >> In particular, it would be good to have predefined Identity mapper and > >> reducer (the mapper actually is available now), reducers that provide > >> simple aggregation (like in Abacus), input formats for commonly used > >> formats (including CSV, flat XML, etc), sorter different from > >> splitter, > >> etc. > >> > >> Then "Streaming should allow to specify a partitioner" would be > >> automatically resolved as a special case. > >> It might be better to implement the whole consistent approach rather > >> then do special cases one by one. > >> > >> -- ab > >> > >> > >> On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote: > >> > >>> Streaming should allow to specify a partitioner > >>> ----------------------------------------------- > >>> > >>> Key: HADOOP-1215 > >>> URL: > >>> https://issues.apache.org/jira/browse/HADOOP-1215 > >>> Project: Hadoop > >>> Issue Type: Improvement > >>> Reporter: Runping Qi > >>> > >>> > >>> > >>> > >>> -- > >>> This message is automatically generated by JIRA. > >>> - > >>> You can reply to this email to add a comment to the issue online. > >>> > > > >
