[
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404201#comment-13404201
]
Antonio Piccolboni commented on HADOOP-1722:
--------------------------------------------
I see we don't have any properties for streaming combiners and I am having some
problems with the reducer being fed the wrong format by the combiner. Before I
submit a bug, I would like to understand what the intended behavior is. Without
properties for combiners, such as stream.combine.input and
stream.combine.output, it seems combiners read the properties that apply to the
reducer. When that happens, it is impossible to have the reducer read
typedbytes and write text when the combiner is on, since the reducer expects
typedbytes in input and the combiner will provide text. The workaround is to
have a job complete using a single serialization format and then a conversion
job that doesn't use combiners, which not only adds a job but also is
surprising to users and a violation of orthogonality. This is in the context of
the development of RHadoop/rmr, a mapreduce package for R.
> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
> Key: HADOOP-1722
> URL: https://issues.apache.org/jira/browse/HADOOP-1722
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Runping Qi
> Assignee: Klaas Bosteels
> Fix For: 1.0.2, 0.21.0
>
> Attachments: HADOOP-1722-branch-0.18.patch,
> HADOOP-1722-branch-0.19.patch, HADOOP-1722-v0.20.1.patch,
> HADOOP-1722-v2.patch, HADOOP-1722-v3.patch, HADOOP-1722-v4.patch,
> HADOOP-1722-v4.patch, HADOOP-1722-v5.patch, HADOOP-1722-v6.patch,
> HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process
> (mapper or reducer) are line
> oriented UTF-8 text. This limit makes it impossible to use those programs
> whose outputs may be non-UTF-8
> (international encoding, or maybe even binary data). Streaming can overcome
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode
> its keys/values,
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding
> protocol,
> they can output arabitary bytearray and the streaming framework can handle
> them.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira