RE: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Runping Qi Wed, 11 Apr 2007 17:25:23 -0700

Responses are inline below.


> -----Original Message-----
> From: Arkady Borkovsky [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 11, 2007 5:09 PM
> To: [email protected]
> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
> specify a partitioner
> 
> This looks very good.
> 
> Probably my "splitter for reduce" is what you call "partitioner"  -- a
> function that given a "key" K and the number of "partitions" N produces
> the I in (0..N-1).
> 
[Runping Qi] 
Yes, that is exactly what the partitioner does.

> "sorter for reduce" -- the function that defines how records with the
> same key are ordered when presented to reduce.
> 
[Runping Qi] 
This can be achieved through partitioner class for streaming.
Suppose your inputs are a list of records, each with multiple fields.
Logically you want to group them by fields 1,2 and 3, but you also want the
records sorted by fields 4 and 5 within each group. What you can do is to
have your mapper compose keys using fields 1,2,3,4 and 5, and have your
partitioner partition by fields 1,2, and 3 only.


> It may be useful to be able to specify how the Map input is
> partitioned.  It is part of InputFormat.  However, from a naive user
> perspective specifying how you read records and find record boundaries
> is very different from specifying how to partition the input.   (I
> agree this is not high priority issue -- as long as I can specify the
> number of map tasks I'd like to have).
> 
> Should the list of specifiable parameters also include Combiner class?
> 
[Runping Qi] 
A good point. Streaming already allows that.


> And once again, it would be great if Abacus classes where available in
> the reworked Streaming through exactly same mechanism without addition
> conventions.
> E.g. I'd like to have tab separated <key, value> as input,
> IdentityMapper, and the Abacus class that gives me the sum, the count,
> and std of values for each key.
>   (It is https://issues.apache.org/jira/browse/HADOOP-1247)
> 
[Runping Qi] 
That will be the work of https://issues.apache.org/jira/browse/HADOOP-1247
It is coming soon.



> -- ab
> 
> On Apr 10, 2007, at 1:58 PM, Runping Qi wrote:
> 
> >
> > Hi Arkady,
> >
> > With my changes that should be available soon, the user can specify
> > the followings:
> >
> > 1. Mapper (a java mapper class or an executable)
> > 2. Reducer (a Java reducer class or an executable). Reduce NONE will be
> > introduced as per HADOOP-1216.
> > 3. Inputformat class
> > 4. OutputFormat class
> > 5. Partitioner
> >
> > I don't understand what do you mean by (input partitioner, splitter for
> > reduce, sorter for reduce). Can you explain?
> >
> > Hadoop has a collection of built-in classes:
> >
> > IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper,
> > LongSumReducer
> >
> > TextInputFormat, SequenceFileInputFormat, TextOutputFormat,
> > SequenceFileOutputFormat, NullOutputFormat
> >
> > Some more coming soon:
> >
> > SequenceFileToLineInputFormat, KeyValueTextInputFormat.
> >
> > We can add IdentityMapper/IdentityReducer/
> > KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop
> > Streaming.
> >
> >
> > Runping
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: Arkady Borkovsky [mailto:[EMAIL PROTECTED]
> >> Sent: Tuesday, April 10, 2007 1:24 PM
> >> To: [email protected]
> >> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
> >> specify a partitioner
> >>
> >> To extend this,
> >> I'd suggest that Hadoop Streaming is interfaced in the following way:
> >>
> >> Map reduce process is parameterized by several algorithms.
> >> This includes at least
> >> 1. mapper
> >> 2. reducer  (including special case of NONE)
> >> 3. input format
> >> 4. input partitioner
> >> 5. splitter for reduce
> >> 6. sorter for reduce
> >>
> >> The current Hadoop Streaming allows to specify only the 1 and 2 (and
> >> gives a limited control on 3)
> >> Nicely, the 1 (mapper) can be specified both as a command to stream
> >> the
> >> data through, or a Java class to use.
> >>
> >> It would make a lot of sense to
> >> (a) allow to specify a Java class that implements each of these
> >> (b) provide meaningful defaults, so that the user of Hadoop Streaming
> >> does need to worry about details irrelevant for her specific task.
> >> (c) provide a set of useful classes so that the user can pick the
> >> necessary ones rather than re-implementing same things again and
> >> again.
> >> (c.1) make sure that there is a convenient short-hand to specify these
> >> predefined classes (e.g. without long package prefix)
> >>
> >> In particular, it would be good to have predefined Identity mapper and
> >> reducer (the mapper actually is available now), reducers that provide
> >> simple aggregation (like in Abacus), input formats for commonly used
> >> formats (including CSV, flat XML, etc), sorter different from
> >> splitter,
> >> etc.
> >>
> >> Then "Streaming should allow to specify a partitioner" would be
> >> automatically resolved as a special case.
> >> It might be better to implement the whole consistent approach rather
> >> then do special cases one by one.
> >>
> >> -- ab
> >>
> >>
> >> On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
> >>
> >>> Streaming should allow to specify a partitioner
> >>> -----------------------------------------------
> >>>
> >>>                  Key: HADOOP-1215
> >>>                  URL:
> >>> https://issues.apache.org/jira/browse/HADOOP-1215
> >>>              Project: Hadoop
> >>>           Issue Type: Improvement
> >>>             Reporter: Runping Qi
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> This message is automatically generated by JIRA.
> >>> -
> >>> You can reply to this email to add a comment to the issue online.
> >>>
> >
> >

RE: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Reply via email to