[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Klaas Bosteels (JIRA) Thu, 29 Jan 2009 01:26:23 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668382#action_12668382
 ]


Klaas Bosteels commented on HADOOP-1722:
----------------------------------------

Replies to Runping's questions:

* As I said, letting the mapper output typed bytes and the reducer take text as 
input probably will not be used much in practice, but that does not mean we 
should remove that option in my opinion.
* You can always set the properties directly via {{-D 
stream.map.input.typed.bytes=false}} etc. so it is still possible to let only 
the reducer output typed bytes. The {{-typedbytes}} command line option just 
provides shorthands for the most common combinations really, and if you are 
going to output typed bytes in the reducer then you might as well output typed 
bytes in the mapper too (since that will be faster and probably also more 
convenient from a programming perspective because types are preserved), so it 
seemed better to me to let the {{-typedbytes output}} shorthand correspond to 
using typed bytes for everything except for the map input. Moreover, the old 
implementation of {{-typedbytes output}} would lead to a sequence file 
containing {{Text}} objects when it is combined with {{-numReduceTasks 0}} and 
{{-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat}}, which 
seems counterintuitive to me. When the programmer wants typed bytes as output, 
then all output sequence files should always contain {{TypedBytesWritables}} 
(as is always the case with the modified implementation of {{-typedbytes 
output}}).

Note also that all of this does not really matter that much. Since text gets 
converted to a typed bytes string, most people will be using typed bytes for 
everything in practice. The {{-typedbytes input|output|mapper|reducer}} options 
are mostly intended to make it possible to convert existing streaming programs 
gradually...

> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-1722-v2.patch, HADOOP-1722-v3.patch, 
> HADOOP-1722-v4.patch, HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process 
> (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs 
> whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome 
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode 
> its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding 
> protocol, 
> they can output arabitary bytearray and the streaming framework can handle 
> them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Reply via email to