[
https://issues.apache.org/jira/browse/HADOOP-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492248
]
Hadoop QA commented on HADOOP-1284:
-----------------------------------
Integrated in Hadoop-Nightly #71 (See
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/71/)
> clean up the protocol between stream mapper/reducer and the framework
> ---------------------------------------------------------------------
>
> Key: HADOOP-1284
> URL: https://issues.apache.org/jira/browse/HADOOP-1284
> Project: Hadoop
> Issue Type: Improvement
> Reporter: Runping Qi
> Assigned To: Runping Qi
> Fix For: 0.13.0
>
> Attachments: patch-1284.txt
>
>
> Right now, the protocol between stream mapper/reducer and the framework is
> very inflexible.
> The mapper/reducer generates line oriented output. The framework picks up
> line by line, and split
> each line into a key/value pair. By default, the substring up to the first
> tab char is the key, and the
> substring after the first tab char is the value.
> However, in many cases, the application wants some control over how the pair
> is split.
> Here, I'd like to introduce the following configuration variables for that:
> 1. "streaming.output.field.separator": the value will be the tab key, by
> default.
> But the user can specify a different one (e.g. ':', or ', ', etc.)
> A map output line can be considered as a list of fields separated by the
> separator.
> 2. "streaming.num.fields.for.mapout.key": the number of the first fields
> will be used the map output key
> (and for sorting in the reduce side).
> The default value is 1.
> The rest of the fields will be used as the value. For example, I can specify
> the first 5 fields as my mapout key.
> 3. "streaming.num.fields.for.partitioning": Sometimes, I want to use fewer
> fields for partitioning to
> achieve "primary/secondary" composite
> key effect as proposed in HADOOP485. The default value is 1.
> For example, I can set "streaming.num.fields.for.partitioning" to 3
> and "streaming.num.fields.for.mapout.key" to 5.
> This effectively amounts to saying that fields 4 and 5 are my secondary key.
> With the above default values, it is compatible with the current behavior
> while introducing a new desirable feature in a clean way.
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.