[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Runping Qi (JIRA) Wed, 28 Jan 2009 08:02:22 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668059#action_12668059
 ]


Runping Qi commented on HADOOP-1722:
------------------------------------


I think you need to have the flags for the three cases:

1.  stream.reduce.output.typed.bytes=true
In this case, the PipeMapRed and PipeReducer classes needs to interpret the 
reduce output as typed bytes and deserialize it accordingly.
This is the case where the user wants to generate binary data by reducers and 
output them in the typed bytes format..

2.  stream.map.output.typed.bytes=true
In this case,  the PipeMapRed and PipeMapper classes needs to interpret the 
mapper output as typed bytes and deserialize it accordingly.
This is the case where the user wants to generate binary data by mappers. In 
this case, the types for the map output key/value pairs
(and that for the reducer input key/value pairs) are typed bytes. The types for 
map output must be the same as those for the reduce input.

3. stream.map.input.typed.bytes=true
The intended use case for this setting may be that the user knows that the 
input data is in typedbytes, and does not want to PipMapRed (PipeMapper) class 
to convert them into text by calling toString. Rather, PipeMapRed class should 
serialize them according to typedbytes.
The mapper program will interpret the serialized format properly.


> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-1722-v2.patch, HADOOP-1722-v3.patch, 
> HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process 
> (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs 
> whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome 
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode 
> its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding 
> protocol, 
> they can output arabitary bytearray and the streaming framework can handle 
> them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Reply via email to