[
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jay Hacker updated MAPREDUCE-5018:
----------------------------------
Attachment: MAPREDUCE-5018.patch
justbytes patch submitted for code review.
> Support raw binary data with Hadoop streaming
> ---------------------------------------------
>
> Key: MAPREDUCE-5018
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: contrib/streaming
> Reporter: Jay Hacker
> Priority: Minor
> Attachments: MAPREDUCE-5018.patch
>
>
> People often have a need to run older programs over many files, and turn to
> Hadoop streaming as a reliable, performant batch system. There are good
> reasons for this:
> 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and
> it is easy to spin up a cluster in the cloud.
> 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
> 3. It is reasonably performant: it moves the code to the data, maintaining
> locality, and scales with the number of nodes.
> Historically Hadoop is of course oriented toward processing key/value pairs,
> and so needs to interpret the data passing through it. Unfortunately, this
> makes it difficult to use Hadoop streaming with programs that don't deal in
> key/value pairs, or with binary data in general. For example, something as
> simple as running md5sum to verify the integrity of files will not give the
> correct result, due to Hadoop's interpretation of the data.
> There have been several attempts at binary serialization schemes for Hadoop
> streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed
> at efficiently encoding key/value pairs, and not passing data through
> unmodified. Even the "RawBytes" serialization scheme adds length fields to
> the data, rendering it not-so-raw.
> I often have a need to run a Unix filter on files stored in HDFS; currently,
> the only way I can do this on the raw data is to copy the data out and run
> the filter on one machine, which is inconvenient, slow, and unreliable. It
> would be very convenient to run the filter as a map-only job, allowing me to
> build on existing (well-tested!) building blocks in the Unix tradition
> instead of reimplementing them as mapreduce programs.
> However, most existing tools don't know about file splits, and so want to
> process whole files; and of course many expect raw binary input and output.
> The solution is to run a map-only job with an InputFormat and OutputFormat
> that just pass raw bytes and don't split. It turns out to be a little more
> complicated with streaming; I have attached a patch with the simplest
> solution I could come up with. I call the format "JustBytes" (as "RawBytes"
> was already taken), and it should be usable with most recent versions of
> Hadoop.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira