[jira] Updated: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Klaas Bosteels (JIRA) Mon, 26 Jan 2009 09:26:31 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Klaas Bosteels updated HADOOP-1722:
-----------------------------------

    Attachment: HADOOP-1722.patch

Are there any comments on the attached patch? It basically implements an 
extended version of Eric's idea concerning the addition of an option that 
triggers the usage of a new binary format. However, instead of a 4 byte length 
it uses a 1 byte type code (and the number of following bytes is derived from 
this type code). This leads to a slightly more compact representation for basic 
types (e.g. a float requires 1 + 4 bytes instead of 4 + 4 bytes), and it also 
solves another important Streaming issue, namely, that all type information is 
lost when everything is converted to strings.


h5. Contents

The patch consists of the following parts:
* A new package {{org.apache.hadoop.typedbytes}} in {{src/core}} that provides 
functionality for dealing with sequences of bytes in which the first byte is a 
type code. This package also includes classes that can convert Writables 
to/from typed bytes and (de)serialize Records to/from typed bytes. The typed 
bytes format itself was kept as simple and straightforward as possible in order 
to make it very easy to write conversion code in other languages.
* Changes to Streaming that add the {{-typedbytes none|input|output|all}} 
option. When typed bytes are requested for the input, the functionality 
provided by the package {{org.apache.hadoop.typedbytes}} is used to convert all 
input Writables to typed bytes (which makes it possible to let Streaming 
programs seamlessly take sequence files containing Records and/or other 
Writables as input), and when typed bytes are used for the output, Streaming 
outputs {{TypedBytesWritables}} (i.e. instances of the 
{{org.apache.hadoop.typedbytes.TypedBytesWritable}} class, which extends 
{{BytesWritable}}).
* A new tool {{DumpTypedBytes}} in {{src/tools}} that dumps DFS files as typed 
bytes to stdout. This can often be a lot more convenient than printing out the 
strings returned by the {{toString()}} methods, and it can also be used to 
fetch an input sample from the DFS for testing Streaming programs that use 
typed bytes.
* A new input format called {{AutoInputFormat}}, which can take text files as 
well as sequence files (or both at the same time) as input. The functionality 
to deal with text and sequence files transparantly was required for the 
{{DumpTypedBytes}} tool, and putting it in an input format makes sense since 
the ability to take both text and sequence files as input can be very useful 
for Streaming programs. Because Streaming still uses the old mapred API, the 
patch includes two versions of {{AutoInputFormat}} (one for the old and another 
for the new API).


h5. Example 

Using the simple Python module available at 
http://github.com/klbostee/typedbytes, the mapper script
{code}
import sys
import typedbytes

input = typedbytes.PairedInput(sys.stdin)
output = typedbytes.PairedOutput(sys.stdout)

for (key, value) in input:
    for word in value.split():
        output.write((word, 1))
{code}
and the reducer script
{code}
import sys
import typedbytes
from itertools import groupby
from operator import itemgetter

input = typedbytes.PairedInput(sys.stdin)
output = typedbytes.PairedOutput(sys.stdout)

for (key, group) in groupby(input, itemgetter(0)):
    values = map(itemgetter(1), group)
    output.write((key, sum(values)))
{code}
can be used to do a simple wordcount. The unit tests include a similar example 
in Java.


h5. Remark

This patch renders HADOOP-4304 mostly obsolete, since it provides all 
underlying functionality required for Dumbo. If this patch gets accepted, then 
future versions of Dumbo will probably only consists of Python code again and 
thus be very easy to install and use, which makes adding Dumbo to contrib less 
of requirement.

> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
>                 Key: HADOOP-1722
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1722
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Runping Qi
>            Assignee: Christopher Zimmerman
>         Attachments: HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process 
> (mapper or reducer) are line 
> oriented UTF-8 text. This limit makes it impossible to use those programs 
> whose outputs may be non-UTF-8
>  (international encoding, or maybe even binary data). Streaming can overcome 
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode 
> its keys/values, 
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding 
> protocol, 
> they can output arabitary bytearray and the streaming framework can handle 
> them.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1722) Make streaming to handle non-utf8 byte array

Reply via email to