[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Devaraj Das (JIRA) Tue, 03 Feb 2009 22:42:23 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670258#action_12670258
 ]


Devaraj Das commented on HADOOP-1722:
-------------------------------------

Looks good overall! The one thing that should be considered here is moving the 
typedBytes package from core to streaming. Is there currently a usecase where 
typedBytes might be used elsewhere? The same argument holds for DumpTypedBytes 
& AutoInputFormat as well. Could we have the DumpTypedBytes integrated with 
StreamJob (as in, if someone wants to test out things, he uses the 
streaming.jar and passes an option to invoke DumpTypedBytes tool).
The other thing is about special handling for the basic types, as opposed to 
using raw bytes for everything. How typical is the use case where we have 
Key/Value types as the basic types. I understand that it makes the on-disk/wire 
representation compact in the cases where the native types are used, but it 
would simplify the framework if we dealt with only raw bytes instead (and 
probably use compression).
It would help if you include an example of a streaming app where binary data is 
consumed/produced. 

> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-1722-v2.patch, HADOOP-1722-v3.patch, 
> HADOOP-1722-v4.patch, HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process 
> (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs 
> whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome 
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode 
> its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding 
> protocol, 
> they can output arabitary bytearray and the streaming framework can handle 
> them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Reply via email to