To extend this,
I'd suggest that Hadoop Streaming is interfaced in the following way:
Map reduce process is parameterized by several algorithms.
This includes at least
1. mapper
2. reducer (including special case of NONE)
3. input format
4. input partitioner
5. splitter for reduce
6. sorter for reduce
The current Hadoop Streaming allows to specify only the 1 and 2 (and
gives a limited control on 3)
Nicely, the 1 (mapper) can be specified both as a command to stream the
data through, or a Java class to use.
It would make a lot of sense to
(a) allow to specify a Java class that implements each of these
(b) provide meaningful defaults, so that the user of Hadoop Streaming
does need to worry about details irrelevant for her specific task.
(c) provide a set of useful classes so that the user can pick the
necessary ones rather than re-implementing same things again and again.
(c.1) make sure that there is a convenient short-hand to specify these
predefined classes (e.g. without long package prefix)
In particular, it would be good to have predefined Identity mapper and
reducer (the mapper actually is available now), reducers that provide
simple aggregation (like in Abacus), input formats for commonly used
formats (including CSV, flat XML, etc), sorter different from splitter,
etc.
Then "Streaming should allow to specify a partitioner" would be
automatically resolved as a special case.
It might be better to implement the whole consistent approach rather
then do special cases one by one.
-- ab
On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
Streaming should allow to specify a partitioner
-----------------------------------------------
Key: HADOOP-1215
URL: https://issues.apache.org/jira/browse/HADOOP-1215
Project: Hadoop
Issue Type: Improvement
Reporter: Runping Qi
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.