Re: [jira] Created: (HADOOP-2806) Streaming has no way to force entire record (or null) as key

arkady borkovsky Mon, 11 Feb 2008 09:38:32 -0800

There are two work-arounds for this:
(a) specify a different field separator
    -jobconf stream.map.output.field.separator=.
   I hope it takes any character, including \0


(b) specify that your "records" have a lot of fields
    -jobconf stream.num.map.output.key.fields=999
   (I hope this works...)

Although both these are "work-arounds" they do not seem to look anyworse than the general ways we specify Streaming Options.


Hopefully, this is going to be better once Streaming is in Pig

--ab



On Feb 9, 2008, at 5:04 PM, Marco Nicosia (JIRA) wrote:

Streaming has no way to force entire record (or null) as key
------------------------------------------------------------

                 Key: HADOOP-2806
URL: https://issues.apache.org/jira/browse/HADOOP-2806
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/streaming
            Reporter: Marco Nicosia
            Priority: Minor
             Fix For: 0.17.0
I think perhaps streaming needs a "-allkey" or "-nullkey" option?Otherwise, I'm concerned there is a subtle streaming documentationproblem.
These two docs:

http://hadoop.apache.org/core/docs/current/streaming.html
http://wiki.apache.org/hadoop/HadoopStreaming (Should be mergedwith above?)
... seem to ignore that streaming, by default, splits key/value onTAB. Sure, they mention it, but in all the simple (no separator)examples, they don't seem to take into account that streaming mayinconsistently decide whether the whole line is the key, or just upto the first tab, should one occur. This means that some recordsmight be sorted differently as compared to others based on whetheror not there's a tab?
Here's a very simple pair of examples, that to the naive, shouldproduce the same output, but do not:
[hod] (marco) >> run dfs -fs local -cat str-tabs
a       1
b       3
a       4

[hod] (marco) >> run dfs -put str-tabs str-tabs
[hod] (marco) >> run jar hadoop-streaming.jar -input str-tabs -output str-tabs.out -mapper /bin/cat -reducer /bin/cat
[blah blah blah]

[hod] (marco) >> run dfs -cat str-tabs.out/part-00000
a       4
a       1
b       3
Compare to this negative-test:
[hod] (marco) >> run dfs -fs local -cat str-notabs
a 1
b 3
a 4

[hod] (marco) >> run dfs -put str-notabs str-notabs
[hod] (marco) >> run jar hadoop-streaming.jar -input str-notabs -output str-notabs.out -mapper /bin/cat -reducer /bin/cat
[blah blah blah]

[hod] (marco) >> run dfs -cat str-notabs.out/part-00000
a 1
a 4
b 3
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (HADOOP-2806) Streaming has no way to force entire record (or null) as key

Reply via email to