[
https://issues.apache.org/jira/browse/HADOOP-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Klaas Bosteels updated HADOOP-1722:
-----------------------------------
Attachment: HADOOP-1722.patch
Are there any comments on the attached patch? It basically implements an
extended version of Eric's idea concerning the addition of an option that
triggers the usage of a new binary format. However, instead of a 4 byte length
it uses a 1 byte type code (and the number of following bytes is derived from
this type code). This leads to a slightly more compact representation for basic
types (e.g. a float requires 1 + 4 bytes instead of 4 + 4 bytes), and it also
solves another important Streaming issue, namely, that all type information is
lost when everything is converted to strings.
h5. Contents
The patch consists of the following parts:
* A new package {{org.apache.hadoop.typedbytes}} in {{src/core}} that provides
functionality for dealing with sequences of bytes in which the first byte is a
type code. This package also includes classes that can convert Writables
to/from typed bytes and (de)serialize Records to/from typed bytes. The typed
bytes format itself was kept as simple and straightforward as possible in order
to make it very easy to write conversion code in other languages.
* Changes to Streaming that add the {{-typedbytes none|input|output|all}}
option. When typed bytes are requested for the input, the functionality
provided by the package {{org.apache.hadoop.typedbytes}} is used to convert all
input Writables to typed bytes (which makes it possible to let Streaming
programs seamlessly take sequence files containing Records and/or other
Writables as input), and when typed bytes are used for the output, Streaming
outputs {{TypedBytesWritables}} (i.e. instances of the
{{org.apache.hadoop.typedbytes.TypedBytesWritable}} class, which extends
{{BytesWritable}}).
* A new tool {{DumpTypedBytes}} in {{src/tools}} that dumps DFS files as typed
bytes to stdout. This can often be a lot more convenient than printing out the
strings returned by the {{toString()}} methods, and it can also be used to
fetch an input sample from the DFS for testing Streaming programs that use
typed bytes.
* A new input format called {{AutoInputFormat}}, which can take text files as
well as sequence files (or both at the same time) as input. The functionality
to deal with text and sequence files transparantly was required for the
{{DumpTypedBytes}} tool, and putting it in an input format makes sense since
the ability to take both text and sequence files as input can be very useful
for Streaming programs. Because Streaming still uses the old mapred API, the
patch includes two versions of {{AutoInputFormat}} (one for the old and another
for the new API).
h5. Example
Using the simple Python module available at
http://github.com/klbostee/typedbytes, the mapper script
{code}
import sys
import typedbytes
input = typedbytes.PairedInput(sys.stdin)
output = typedbytes.PairedOutput(sys.stdout)
for (key, value) in input:
for word in value.split():
output.write((word, 1))
{code}
and the reducer script
{code}
import sys
import typedbytes
from itertools import groupby
from operator import itemgetter
input = typedbytes.PairedInput(sys.stdin)
output = typedbytes.PairedOutput(sys.stdout)
for (key, group) in groupby(input, itemgetter(0)):
values = map(itemgetter(1), group)
output.write((key, sum(values)))
{code}
can be used to do a simple wordcount. The unit tests include a similar example
in Java.
h5. Remark
This patch renders HADOOP-4304 mostly obsolete, since it provides all
underlying functionality required for Dumbo. If this patch gets accepted, then
future versions of Dumbo will probably only consists of Python code again and
thus be very easy to install and use, which makes adding Dumbo to contrib less
of requirement.
> Make streaming to handle non-utf8 byte array
> --------------------------------------------
>
> Key: HADOOP-1722
> URL: https://issues.apache.org/jira/browse/HADOOP-1722
> Project: Hadoop Core
> Issue Type: Improvement
> Components: contrib/streaming
> Reporter: Runping Qi
> Assignee: Christopher Zimmerman
> Attachments: HADOOP-1722.patch
>
>
> Right now, the streaming framework expects the output sof the steam process
> (mapper or reducer) are line
> oriented UTF-8 text. This limit makes it impossible to use those programs
> whose outputs may be non-UTF-8
> (international encoding, or maybe even binary data). Streaming can overcome
> this limit by introducing a simple
> encoding protocol. For example, it can allow the mapper/reducer to hexencode
> its keys/values,
> the framework decodes them in the Java side.
> This way, as long as the mapper/reducer executables follow this encoding
> protocol,
> they can output arabitary bytearray and the streaming framework can handle
> them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.