[jira] Updated: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Klaas Bosteels (JIRA) Wed, 04 Feb 2009 08:16:30 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Klaas Bosteels updated HADOOP-1722:
-----------------------------------

    Attachment: HADOOP-1722-v4.patch

Version 4 of the patch moves everything that was added in core to streaming, as 
suggested by Deveraj. 

Some comments:
* Since the typed bytes classes are still in the package 
{{org.apache.hadoop.typedbytes}} (and not in 
{{org.apache.hadoop.streaming.typedbytes}} or so), we can still move them to 
core later without braking sequence files that rely on {{TypedBytesWritable}}.
* I extended the streaming command-line format from "hadoop jar <streaming.jar> 
<options>" to "hadoop jar <streaming.jar> <command> <options>". This is 
backwards compatible because the command "streamjob" is assumed when no command 
is given explicitly, and it allowed me to add the commands "dumptb" and 
"loadtb" ("dumbtb" corresponds to the {{DumpTypedBytes}} class that used to be 
in tools, and "loadtb" is a new command that does (more or less) the reverse 
operation, namely, it reads typed bytes from stdin and writes them to a 
sequence file on the DFS).

> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-1722-v2.patch, HADOOP-1722-v3.patch, 
> HADOOP-1722-v4.patch, HADOOP-1722-v4.patch, HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process 
> (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs 
> whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome 
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode 
> its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding 
> protocol, 
> they can output arabitary bytearray and the streaming framework can handle 
> them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Reply via email to