[ https://issues.apache.org/jira/browse/HADOOP-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj Das updated HADOOP-2806: -------------------------------- Resolution: Fixed Status: Resolved (was: Patch Available) I just committed this. Thanks, Amareshwari! > Streaming has no way to force entire record (or null) as key > ------------------------------------------------------------ > > Key: HADOOP-2806 > URL: https://issues.apache.org/jira/browse/HADOOP-2806 > Project: Hadoop Core > Issue Type: Bug > Components: contrib/streaming > Reporter: Marco Nicosia > Assignee: Amareshwari Sriramadasu > Priority: Minor > Fix For: 0.17.0 > > Attachments: patch-2806.txt > > > I think perhaps streaming needs a "-allkey" or "-nullkey" option? Otherwise, > I'm concerned there is a subtle streaming documentation problem. > These two docs: > http://hadoop.apache.org/core/docs/current/streaming.html > http://wiki.apache.org/hadoop/HadoopStreaming (Should be merged with above?) > ... seem to ignore that streaming, by default, splits key/value on TAB. Sure, > they mention it, but in all the simple (no separator) examples, they don't > seem to take into account that streaming may inconsistently decide whether > the whole line is the key, or just up to the first tab, should one occur. > This means that some records might be sorted differently as compared to > others based on whether or not there's a tab? > Here's a very simple pair of examples, that to the naive, should produce the > same output, but do not: > > [hod] (marco) >> run dfs -fs local -cat str-tabs > > a 1 > > b 3 > > a 4 > > > > [hod] (marco) >> run dfs -put str-tabs str-tabs > > > > [hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs -output > > str-tabs.out -mapper /bin/cat -reducer /bin/cat > > [blah blah blah] > > > > [hod] (marco) >> run dfs -cat str-tabs.out/part-00000 > > a 4 > > a 1 > > b 3 > Compare to this negative-test: > > [hod] (marco) >> run dfs -fs local -cat str-notabs > > a 1 > > b 3 > > a 4 > > > > [hod] (marco) >> run dfs -put str-notabs str-notabs > > > > [hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs -output > > str-notabs.out -mapper /bin/cat -reducer /bin/cat > > [blah blah blah] > > > > [hod] (marco) >> run dfs -cat str-notabs.out/part-00000 > > a 1 > > a 4 > > b 3 > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.