This looks very good.
Probably my "splitter for reduce" is what you call "partitioner" -- a
function that given a "key" K and the number of "partitions" N produces
the I in (0..N-1).
"sorter for reduce" -- the function that defines how records with the
same key are ordered when presented to reduce.
It may be useful to be able to specify how the Map input is
partitioned. It is part of InputFormat. However, from a naive user
perspective specifying how you read records and find record boundaries
is very different from specifying how to partition the input. (I
agree this is not high priority issue -- as long as I can specify the
number of map tasks I'd like to have).
Should the list of specifiable parameters also include Combiner class?
And once again, it would be great if Abacus classes where available in
the reworked Streaming through exactly same mechanism without addition
conventions.
E.g. I'd like to have tab separated <key, value> as input,
IdentityMapper, and the Abacus class that gives me the sum, the count,
and std of values for each key.
(It is https://issues.apache.org/jira/browse/HADOOP-1247)
-- ab
On Apr 10, 2007, at 1:58 PM, Runping Qi wrote:
Hi Arkady,
With my changes that should be available soon, the user can specify
the followings:
1. Mapper (a java mapper class or an executable)
2. Reducer (a Java reducer class or an executable). Reduce NONE will be
introduced as per HADOOP-1216.
3. Inputformat class
4. OutputFormat class
5. Partitioner
I don't understand what do you mean by (input partitioner, splitter for
reduce, sorter for reduce). Can you explain?
Hadoop has a collection of built-in classes:
IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper,
LongSumReducer
TextInputFormat, SequenceFileInputFormat, TextOutputFormat,
SequenceFileOutputFormat, NullOutputFormat
Some more coming soon:
SequenceFileToLineInputFormat, KeyValueTextInputFormat.
We can add IdentityMapper/IdentityReducer/
KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop
Streaming.
Runping
-----Original Message-----
From: Arkady Borkovsky [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 10, 2007 1:24 PM
To: [email protected]
Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
specify a partitioner
To extend this,
I'd suggest that Hadoop Streaming is interfaced in the following way:
Map reduce process is parameterized by several algorithms.
This includes at least
1. mapper
2. reducer (including special case of NONE)
3. input format
4. input partitioner
5. splitter for reduce
6. sorter for reduce
The current Hadoop Streaming allows to specify only the 1 and 2 (and
gives a limited control on 3)
Nicely, the 1 (mapper) can be specified both as a command to stream
the
data through, or a Java class to use.
It would make a lot of sense to
(a) allow to specify a Java class that implements each of these
(b) provide meaningful defaults, so that the user of Hadoop Streaming
does need to worry about details irrelevant for her specific task.
(c) provide a set of useful classes so that the user can pick the
necessary ones rather than re-implementing same things again and
again.
(c.1) make sure that there is a convenient short-hand to specify these
predefined classes (e.g. without long package prefix)
In particular, it would be good to have predefined Identity mapper and
reducer (the mapper actually is available now), reducers that provide
simple aggregation (like in Abacus), input formats for commonly used
formats (including CSV, flat XML, etc), sorter different from
splitter,
etc.
Then "Streaming should allow to specify a partitioner" would be
automatically resolved as a special case.
It might be better to implement the whole consistent approach rather
then do special cases one by one.
-- ab
On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
Streaming should allow to specify a partitioner
-----------------------------------------------
Key: HADOOP-1215
URL:
https://issues.apache.org/jira/browse/HADOOP-1215
Project: Hadoop
Issue Type: Improvement
Reporter: Runping Qi
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.