make key-value separators in hadoop streaming fully configurable
----------------------------------------------------------------

                 Key: HADOOP-3341
                 URL: https://issues.apache.org/jira/browse/HADOOP-3341
             Project: Hadoop Core
          Issue Type: Improvement
          Components: contrib/streaming
            Reporter: Zheng Shao


By default, hadoop streaming uses TAB as the separator in all places.  However 
in some environments, user may want to use customized separators (e.g, ^A = 
\u0001).

The separator logic in hadoop streaming is very convoluted. Here is a brief 
summary:

InputFormat {
    KeyValueLineRecordReader.java:59:
S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
}

Mapper {
    PipeMapper.java:88: 
S2: clientOut_.write('\t');

    Call mapper process

    PipeMapRed.java:124:
S3: String mapOutputFieldSeparator = 
job_.get("stream.map.output.field.separator", "\t");
    PipeMapRed.java:128:
    this.numOfMapOutputKeyFields = 
job_.getInt("stream.num.map.output.key.fields", 1);
}


Reducer {
    PipeReducer.java:78:
S4: clientOut_.write('\t');

    Call reducer process

    PipeMapRed.java:125:
S5: String reduceOutputFieldSeparator = 
job_.get("stream.reduce.output.field.separator", "\t");
    PipeMapRed.java:129:
    this.numOfReduceOutputKeyFields = 
job_.getInt("stream.num.reduce.output.key.fields", 1);
}

OutputFormat {
    TextOuputFormat.java:112:
S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", 
"\t");
}

Short-cuts: 
1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are 
directly feed into the mapper (through the value part of the key-value pair - 
keys, which are offsets, are directly ignored).
2. For jobs with no reducers, The "Reducer" step is skipped.


We need to make S3 and S4 configurable, possibly under the following names for 
conformity:
stream.map.input.field.separator
stream.reduce.input.field.separator


Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf 
stream.map.input.field.separator=^A -jobconf 
stream.map.output.field.separator=^A -jobconf 
stream.reducer.input.field.separator=^A -jobconf 
stream.reducer.output.field.separator=^A -jobconf 
mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB 
in every place!

Maybe hadoop streaming can also provide a single option to override these 6 
options.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to