make key-value separators in hadoop streaming fully configurable
----------------------------------------------------------------
Key: HADOOP-3341
URL: https://issues.apache.org/jira/browse/HADOOP-3341
Project: Hadoop Core
Issue Type: Improvement
Components: contrib/streaming
Reporter: Zheng Shao
By default, hadoop streaming uses TAB as the separator in all places. However
in some environments, user may want to use customized separators (e.g, ^A =
\u0001).
The separator logic in hadoop streaming is very convoluted. Here is a brief
summary:
InputFormat {
KeyValueLineRecordReader.java:59:
S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
}
Mapper {
PipeMapper.java:88:
S2: clientOut_.write('\t');
Call mapper process
PipeMapRed.java:124:
S3: String mapOutputFieldSeparator =
job_.get("stream.map.output.field.separator", "\t");
PipeMapRed.java:128:
this.numOfMapOutputKeyFields =
job_.getInt("stream.num.map.output.key.fields", 1);
}
Reducer {
PipeReducer.java:78:
S4: clientOut_.write('\t');
Call reducer process
PipeMapRed.java:125:
S5: String reduceOutputFieldSeparator =
job_.get("stream.reduce.output.field.separator", "\t");
PipeMapRed.java:129:
this.numOfReduceOutputKeyFields =
job_.getInt("stream.num.reduce.output.key.fields", 1);
}
OutputFormat {
TextOuputFormat.java:112:
S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator",
"\t");
}
Short-cuts:
1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are
directly feed into the mapper (through the value part of the key-value pair -
keys, which are offsets, are directly ignored).
2. For jobs with no reducers, The "Reducer" step is skipped.
We need to make S3 and S4 configurable, possibly under the following names for
conformity:
stream.map.input.field.separator
stream.reduce.input.field.separator
Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf
stream.map.input.field.separator=^A -jobconf
stream.map.output.field.separator=^A -jobconf
stream.reducer.input.field.separator=^A -jobconf
stream.reducer.output.field.separator=^A -jobconf
mapred.textoutputformat.separator=^A, we will be able to use ^A instead of TAB
in every place!
Maybe hadoop streaming can also provide a single option to override these 6
options.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.