[ https://issues.apache.org/jira/browse/HADOOP-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608596#action_12608596 ]
Hadoop QA commented on HADOOP-3341: ----------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12384785/3341-5.patch against trunk revision 671563. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 11 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2752/console This message is automatically generated. > make key-value separators in hadoop streaming fully configurable > ---------------------------------------------------------------- > > Key: HADOOP-3341 > URL: https://issues.apache.org/jira/browse/HADOOP-3341 > Project: Hadoop Core > Issue Type: Improvement > Components: contrib/streaming > Reporter: Zheng Shao > Assignee: Zheng Shao > Attachments: 3341-1.patch, 3341-2.patch, 3341-3.patch, 3341-4.patch, > 3341-5.patch > > > By default, hadoop streaming uses TAB as the separator in all places. > However in some environments, user may want to use customized separators > (e.g, ^A = \u0001). > The separator logic in hadoop streaming is very convoluted. Here is a brief > summary: > InputFormat { > KeyValueLineRecordReader.java:59: > S1: String sepStr = job.get("key.value.separator.in.input.line", "\t"); > } > Mapper { > PipeMapper.java:88: > S2: clientOut_.write('\t'); > Call mapper process > PipeMapRed.java:124: > S3: String mapOutputFieldSeparator = > job_.get("stream.map.output.field.separator", "\t"); > PipeMapRed.java:128: > this.numOfMapOutputKeyFields = > job_.getInt("stream.num.map.output.key.fields", 1); > } > Reducer { > PipeReducer.java:78: > S4: clientOut_.write('\t'); > Call reducer process > PipeMapRed.java:125: > S5: String reduceOutputFieldSeparator = > job_.get("stream.reduce.output.field.separator", "\t"); > PipeMapRed.java:129: > this.numOfReduceOutputKeyFields = > job_.getInt("stream.num.reduce.output.key.fields", 1); > } > OutputFormat { > TextOuputFormat.java:112: > S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator", > "\t"); > } > Short-cuts: > 1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are > directly feed into the mapper (through the value part of the key-value pair - > keys, which are offsets, are directly ignored). > 2. For jobs with no reducers, The "Reducer" step is skipped. > We need to make S3 and S4 configurable, possibly under the following names > for conformity: > stream.map.input.field.separator > stream.reduce.input.field.separator > Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf > stream.map.input.field.separator=^A -jobconf > stream.map.output.field.separator=^A -jobconf > stream.reducer.input.field.separator=^A -jobconf > stream.reducer.output.field.separator=^A -jobconf > mapred.textoutputformat.separator=^A, we will be able to use ^A instead of > TAB in every place! > Maybe hadoop streaming can also provide a single option to override these 6 > options. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.