Streaming has no way to force entire record (or null) as key
------------------------------------------------------------

                 Key: HADOOP-2806
                 URL: https://issues.apache.org/jira/browse/HADOOP-2806
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/streaming
            Reporter: Marco Nicosia
            Priority: Minor
             Fix For: 0.17.0


I think perhaps streaming needs a "-allkey" or "-nullkey" option? Otherwise, 
I'm concerned there is a subtle streaming documentation problem.

These two docs:

http://hadoop.apache.org/core/docs/current/streaming.html
http://wiki.apache.org/hadoop/HadoopStreaming (Should be merged with above?)

... seem to ignore that streaming, by default, splits key/value on TAB. Sure, 
they mention it, but in all the simple (no separator) examples, they don't seem 
to take into account that streaming may inconsistently decide whether the whole 
line is the key, or just up to the first tab, should one occur. This means that 
some records might be sorted differently as compared to others based on whether 
or not there's a tab?

Here's a very simple pair of examples, that to the naive, should produce the 
same output, but do not:

> [hod] (marco) >> run dfs -fs local -cat str-tabs
> a       1
> b       3
> a       4
> 
> [hod] (marco) >> run dfs -put str-tabs str-tabs
> 
> [hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs -output 
> str-tabs.out -mapper /bin/cat -reducer /bin/cat     
> [blah blah blah]
> 
> [hod] (marco) >> run dfs -cat str-tabs.out/part-00000
> a       4
> a       1
> b       3

Compare to this negative-test:
> [hod] (marco) >> run dfs -fs local -cat str-notabs
> a 1
> b 3
> a 4
> 
> [hod] (marco) >> run dfs -put str-notabs str-notabs
> 
> [hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs -output 
> str-notabs.out -mapper /bin/cat -reducer /bin/cat
> [blah blah blah]
> 
> [hod] (marco) >> run dfs -cat str-notabs.out/part-00000
> a 1
> a 4
> b 3
> 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to