In streaming, map-output cannot have empty keys
-----------------------------------------------
Key: HADOOP-2954
URL: https://issues.apache.org/jira/browse/HADOOP-2954
Project: Hadoop Core
Issue Type: Bug
Components: contrib/streaming
Affects Versions: 0.16.0
Environment: All
Reporter: Milind Bhandarkar
Assignee: Sameer Paranjpye
Fix For: 0.17.0
Here is the analysis, when the mapper and reducer both are /bin/cat,
default key field separator: '\t' (or tab)
for ex, if the input line is:
\tSDSDFIKSDFSDFJS
the input for the mapper ('cat' in this case) is:
\tSDSDFIKSDFSDFJS
-
the output of the mapper is split into a key, value pair as below:
(key, value) -> (\tSDSDFIKSDFSDFJS, "")
(i.e. the value is empty)
the function which splits the output into key,value pair for
streaming jobs, ignores the first character of the line
-
from the above (key, value) pair, the input for the reducer is:
(key followed by separator followed by value)
\tSDSDFIKSDFSDFJS\t
if the reducer is set to NONE, the above line is the output of
the map task
-
the output of the reducer ('cat' in this case) is:
\tSDSDFIKSDFSDFJS\t
-
if the line starts with the field separator, it is possible that
the output of the mapper can be assigned to different reducers because
it is possible that the line contains more than once instance of the
field separator - for ex:
input-line=\tABCDEFGH
key=\tABCDEFGH
value=
(value is empty)
output-line=\tABCDEFGH\t
line=\tABCDEFGHYH\tJHUHJH
key=\tABCDEFGHYH
value=JHUHJH
output-line=\tABCDEFGHYH\tJHUHJH
assuming defaults (HashPartitioner), they are likely to be assigned to
different reducers because the keys are different.
The streaming contract says that from beginning of the line upto the first tab
is the key, so key should be empty string. But it is not.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.