[ 
https://issues.apache.org/jira/browse/HADOOP-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen O'Malley updated HADOOP-3341:
----------------------------------

    Status: Open  (was: Patch Available)

This looks good, except that the data fields should be down in PipeMapper and 
PipeReducer, respectively. They should also be made private. You can configure 
them in the PipeMapper and PipeReducer configure methods. Please also include a 
test for the change.

> make key-value separators in hadoop streaming fully configurable
> ----------------------------------------------------------------
>
>                 Key: HADOOP-3341
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3341
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: 3341-1.patch
>
>
> By default, hadoop streaming uses TAB as the separator in all places.  
> However in some environments, user may want to use customized separators 
> (e.g, ^A = \u0001).
> The separator logic in hadoop streaming is very convoluted. Here is a brief 
> summary:
> InputFormat {
>     KeyValueLineRecordReader.java:59:
> S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
> }
> Mapper {
>     PipeMapper.java:88: 
> S2: clientOut_.write('\t');
>     Call mapper process
>     PipeMapRed.java:124:
> S3: String mapOutputFieldSeparator = 
> job_.get("stream.map.output.field.separator", "\t");
>     PipeMapRed.java:128:
>     this.numOfMapOutputKeyFields = 
> job_.getInt("stream.num.map.output.key.fields", 1);
> }
> Reducer {
>     PipeReducer.java:78:
> S4: clientOut_.write('\t');
>     Call reducer process
>     PipeMapRed.java:125:
> S5: String reduceOutputFieldSeparator = 
> job_.get("stream.reduce.output.field.separator", "\t");
>     PipeMapRed.java:129:
>     this.numOfReduceOutputKeyFields = 
> job_.getInt("stream.num.reduce.output.key.fields", 1);
> }
> OutputFormat {
>     TextOuputFormat.java:112:
> S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", 
> "\t");
> }
> Short-cuts: 
> 1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are 
> directly feed into the mapper (through the value part of the key-value pair - 
> keys, which are offsets, are directly ignored).
> 2. For jobs with no reducers, The "Reducer" step is skipped.
> We need to make S3 and S4 configurable, possibly under the following names 
> for conformity:
> stream.map.input.field.separator
> stream.reduce.input.field.separator
> Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf 
> stream.map.input.field.separator=^A -jobconf 
> stream.map.output.field.separator=^A -jobconf 
> stream.reducer.input.field.separator=^A -jobconf 
> stream.reducer.output.field.separator=^A -jobconf 
> mapred.textoutputformat.separator=^A, we will be able to use ^A instead of 
> TAB in every place!
> Maybe hadoop streaming can also provide a single option to override these 6 
> options.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to