[
https://issues.apache.org/jira/browse/HADOOP-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598151#action_12598151
]
Hadoop QA commented on HADOOP-3341:
-----------------------------------
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12382335/3341-4.patch
against trunk revision 658035.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 11 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac
compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of
release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/artifact/trunk/build/test/checkstyle-errors.html
Console output:
http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2502/console
This message is automatically generated.
> make key-value separators in hadoop streaming fully configurable
> ----------------------------------------------------------------
>
> Key: HADOOP-3341
> URL: https://issues.apache.org/jira/browse/HADOOP-3341
> Project: Hadoop Core
> Issue Type: Improvement
> Components: contrib/streaming
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Attachments: 3341-1.patch, 3341-2.patch, 3341-3.patch, 3341-4.patch
>
>
> By default, hadoop streaming uses TAB as the separator in all places.
> However in some environments, user may want to use customized separators
> (e.g, ^A = \u0001).
> The separator logic in hadoop streaming is very convoluted. Here is a brief
> summary:
> InputFormat {
> KeyValueLineRecordReader.java:59:
> S1: String sepStr = job.get("key.value.separator.in.input.line", "\t");
> }
> Mapper {
> PipeMapper.java:88:
> S2: clientOut_.write('\t');
> Call mapper process
> PipeMapRed.java:124:
> S3: String mapOutputFieldSeparator =
> job_.get("stream.map.output.field.separator", "\t");
> PipeMapRed.java:128:
> this.numOfMapOutputKeyFields =
> job_.getInt("stream.num.map.output.key.fields", 1);
> }
> Reducer {
> PipeReducer.java:78:
> S4: clientOut_.write('\t');
> Call reducer process
> PipeMapRed.java:125:
> S5: String reduceOutputFieldSeparator =
> job_.get("stream.reduce.output.field.separator", "\t");
> PipeMapRed.java:129:
> this.numOfReduceOutputKeyFields =
> job_.getInt("stream.num.reduce.output.key.fields", 1);
> }
> OutputFormat {
> TextOuputFormat.java:112:
> S6: String keyValueSeparator = job.get("mapred.textoutputformat.separator",
> "\t");
> }
> Short-cuts:
> 1. In case we use "TextInputFormat", S1 and S2 are not used at all. Lines are
> directly feed into the mapper (through the value part of the key-value pair -
> keys, which are offsets, are directly ignored).
> 2. For jobs with no reducers, The "Reducer" step is skipped.
> We need to make S3 and S4 configurable, possibly under the following names
> for conformity:
> stream.map.input.field.separator
> stream.reduce.input.field.separator
> Then, by specifying: -jobconf key.value.separator.in.input.line=^A -jobconf
> stream.map.input.field.separator=^A -jobconf
> stream.map.output.field.separator=^A -jobconf
> stream.reducer.input.field.separator=^A -jobconf
> stream.reducer.output.field.separator=^A -jobconf
> mapred.textoutputformat.separator=^A, we will be able to use ^A instead of
> TAB in every place!
> Maybe hadoop streaming can also provide a single option to override these 6
> options.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.