Antonio Piccolboni created MAPREDUCE-5048:
---------------------------------------------
Summary: streaming combiner feature breaks when input binary,
output text
Key: MAPREDUCE-5048
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5048
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: contrib/streaming
Affects Versions: 1.0.2
Environment: centos 6.2
Reporter: Antonio Piccolboni
When running hadoop streaming job with binary input and shuffling but text
output with combiner on, it fails with error
java.lang.RuntimeException: java.io.IOException: wrong key class: class
org.apache.hadoop.io.Text is not class
org.apache.hadoop.typedbytes.TypedBytesWritable
repro:
hadoop jar <streaming jar> -D 'stream.map.input=typedbytes' -D
'stream.map.output=typedbytes' -D 'stream.reduce.input=typedbytes'
-input <sequence file containing typedbytes> -output <any valid dir>
-mapper cat -combiner cat -reducer cat -inputformat
'org.apache.hadoop.streaming.AutoInputFormat'
if you remove the -combiner option, it works with only performance
implications. If you specify in addition -D
'stream.reduce.output=typedbytes', it succeeds but outputs raw typedbytes
(without the sequence file superstructure)
I asked in the discussion of HADOOP-1722 (where typedbytes was first
introduced) if this is a bug or my misunderstanding of that spec and a
committer chipped in saying it seems a bug to him too.
Originally reported by a user of the rmr2 package for R and filed by me here
https://github.com/RevolutionAnalytics/rmr2/issues/16
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira