To extend this,
I'd suggest that Hadoop Streaming is interfaced in the following way:

Map reduce process is parameterized by several algorithms.
This includes at least
1. mapper
2. reducer  (including special case of NONE)
3. input format
4. input partitioner
5. splitter for reduce
6. sorter for reduce

The current Hadoop Streaming allows to specify only the 1 and 2 (and gives a limited control on 3) Nicely, the 1 (mapper) can be specified both as a command to stream the data through, or a Java class to use.

It would make a lot of sense to
(a) allow to specify a Java class that implements each of these
(b) provide meaningful defaults, so that the user of Hadoop Streaming does need to worry about details irrelevant for her specific task. (c) provide a set of useful classes so that the user can pick the necessary ones rather than re-implementing same things again and again. (c.1) make sure that there is a convenient short-hand to specify these predefined classes (e.g. without long package prefix)

In particular, it would be good to have predefined Identity mapper and reducer (the mapper actually is available now), reducers that provide simple aggregation (like in Abacus), input formats for commonly used formats (including CSV, flat XML, etc), sorter different from splitter, etc.

Then "Streaming should allow to specify a partitioner" would be automatically resolved as a special case. It might be better to implement the whole consistent approach rather then do special cases one by one.

-- ab


On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:

Streaming should allow to specify a partitioner
-----------------------------------------------

                 Key: HADOOP-1215
                 URL: https://issues.apache.org/jira/browse/HADOOP-1215
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Runping Qi




--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Reply via email to