[jira] Updated: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Klaas Bosteels (JIRA) Tue, 27 Jan 2009 09:49:21 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Klaas Bosteels updated HADOOP-1722:
-----------------------------------

    Attachment: HADOOP-1722-v2.patch

This second version of my patch addresses the issues raised by Runping:
* The javadoc for the package {{org.apache.hadoop.typedbytes}} now includes a 
detailed description of the typed bytes format.
* The (boolean-valued) typedbytes-related properties for streaming are now:
** {{stream.map.input.typed.bytes}}
** {{stream.map.output.typed.bytes}}
** {{stream.reduce.input.typed.bytes}}
** {{stream.reduce.output.typed.bytes}}
* The command line option {{-typebytes}} was changed such that it can take the 
values {{none|mapper|reducer|input|output|all}} (that should cover most cases, 
and otherwise the properties listed above can be set manually).

BTW: The comment "The reduce input type flag should always be the same as the 
map output flag" is not really valid, since {{TypedBytesWritable}}'s 
{{toString()}} outputs sensible strings and hence it would not be impossible to 
output typed bytes in the mapper and let streaming convert to strings and pass 
those strings as input to the reducer.

Any other comments?

> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Christopher Zimmerman
>         Attachments: HADOOP-1722-v2.patch, HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process 
> (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs 
> whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome 
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode 
> its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding 
> protocol, 
> they can output arabitary bytearray and the streaming framework can handle 
> them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Reply via email to