[jira] Updated: (HADOOP-2806) Streaming has no way to force entire record (or null) as key

Devaraj Das (JIRA) Sun, 16 Mar 2008 22:49:13 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Devaraj Das updated HADOOP-2806:
--------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Amareshwari!

> Streaming has no way to force entire record (or null) as key
> ------------------------------------------------------------
>
>                 Key: HADOOP-2806
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2806
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/streaming
>            Reporter: Marco Nicosia
>            Assignee: Amareshwari Sriramadasu
>            Priority: Minor
>             Fix For: 0.17.0
>
>         Attachments: patch-2806.txt
>
>
> I think perhaps streaming needs a "-allkey" or "-nullkey" option? Otherwise, 
> I'm concerned there is a subtle streaming documentation problem.
> These two docs:
> http://hadoop.apache.org/core/docs/current/streaming.html
> http://wiki.apache.org/hadoop/HadoopStreaming (Should be merged with above?)
> ... seem to ignore that streaming, by default, splits key/value on TAB. Sure, 
> they mention it, but in all the simple (no separator) examples, they don't 
> seem to take into account that streaming may inconsistently decide whether 
> the whole line is the key, or just up to the first tab, should one occur. 
> This means that some records might be sorted differently as compared to 
> others based on whether or not there's a tab?
> Here's a very simple pair of examples, that to the naive, should produce the 
> same output, but do not:
> > [hod] (marco) >> run dfs -fs local -cat str-tabs
> > a       1
> > b       3
> > a       4
> > 
> > [hod] (marco) >> run dfs -put str-tabs str-tabs
> > 
> > [hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs -output 
> > str-tabs.out -mapper /bin/cat -reducer /bin/cat     
> > [blah blah blah]
> > 
> > [hod] (marco) >> run dfs -cat str-tabs.out/part-00000
> > a       4
> > a       1
> > b       3
> Compare to this negative-test:
> > [hod] (marco) >> run dfs -fs local -cat str-notabs
> > a 1
> > b 3
> > a 4
> > 
> > [hod] (marco) >> run dfs -put str-notabs str-notabs
> > 
> > [hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs -output 
> > str-notabs.out -mapper /bin/cat -reducer /bin/cat
> > [blah blah blah]
> > 
> > [hod] (marco) >> run dfs -cat str-notabs.out/part-00000
> > a 1
> > a 4
> > b 3
> > 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2806) Streaming has no way to force entire record (or null) as key

Reply via email to