Re: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Arkady Borkovsky Tue, 10 Apr 2007 13:24:54 -0700

To extend this,
I'd suggest that Hadoop Streaming is interfaced in the following way:


Map reduce process is parameterized by several algorithms.
This includes at least
1. mapper
2. reducer  (including special case of NONE)
3. input format
4. input partitioner
5. splitter for reduce
6. sorter for reduce

The current Hadoop Streaming allows to specify only the 1 and 2 (andgives a limited control on 3)Nicely, the 1 (mapper) can be specified both as a command to stream thedata through, or a Java class to use.


It would make a lot of sense to
(a) allow to specify a Java class that implements each of these

(b) provide meaningful defaults, so that the user of Hadoop Streamingdoes need to worry about details irrelevant for her specific task.(c) provide a set of useful classes so that the user can pick thenecessary ones rather than re-implementing same things again and again.(c.1) make sure that there is a convenient short-hand to specify thesepredefined classes (e.g. without long package prefix)

In particular, it would be good to have predefined Identity mapper andreducer (the mapper actually is available now), reducers that providesimple aggregation (like in Abacus), input formats for commonly usedformats (including CSV, flat XML, etc), sorter different from splitter,etc.

Then "Streaming should allow to specify a partitioner" would beautomatically resolved as a special case.It might be better to implement the whole consistent approach ratherthen do special cases one by one.


-- ab


On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:

Streaming should allow to specify a partitioner
-----------------------------------------------

                 Key: HADOOP-1215
                 URL: https://issues.apache.org/jira/browse/HADOOP-1215
             Project: Hadoop
          Issue Type: Improvement
            Reporter: Runping Qi




--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner

Reply via email to