[
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780917#comment-16780917
]
Ruslan Dautkhanov commented on MAPREDUCE-5018:
----------------------------------------------
Any workaround for this .. would be great to use Hadoop Streaming facility for
binary files..
> Support raw binary data with Hadoop streaming
> ---------------------------------------------
>
> Key: MAPREDUCE-5018
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: contrib/streaming
> Affects Versions: 1.1.2
> Reporter: Jay Hacker
> Assignee: Steven Willis
> Priority: Minor
> Labels: BB2015-05-TBR
> Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch,
> MAPREDUCE-5018.patch, justbytes.jar, mapstream
>
>
> People often have a need to run older programs over many files, and turn to
> Hadoop streaming as a reliable, performant batch system. There are good
> reasons for this:
> 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and
> it is easy to spin up a cluster in the cloud.
> 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
> 3. It is reasonably performant: it moves the code to the data, maintaining
> locality, and scales with the number of nodes.
> Historically Hadoop is of course oriented toward processing key/value pairs,
> and so needs to interpret the data passing through it. Unfortunately, this
> makes it difficult to use Hadoop streaming with programs that don't deal in
> key/value pairs, or with binary data in general. For example, something as
> simple as running md5sum to verify the integrity of files will not give the
> correct result, due to Hadoop's interpretation of the data.
> There have been several attempts at binary serialization schemes for Hadoop
> streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed
> at efficiently encoding key/value pairs, and not passing data through
> unmodified. Even the "RawBytes" serialization scheme adds length fields to
> the data, rendering it not-so-raw.
> I often have a need to run a Unix filter on files stored in HDFS; currently,
> the only way I can do this on the raw data is to copy the data out and run
> the filter on one machine, which is inconvenient, slow, and unreliable. It
> would be very convenient to run the filter as a map-only job, allowing me to
> build on existing (well-tested!) building blocks in the Unix tradition
> instead of reimplementing them as mapreduce programs.
> However, most existing tools don't know about file splits, and so want to
> process whole files; and of course many expect raw binary input and output.
> The solution is to run a map-only job with an InputFormat and OutputFormat
> that just pass raw bytes and don't split. It turns out to be a little more
> complicated with streaming; I have attached a patch with the simplest
> solution I could come up with. I call the format "JustBytes" (as "RawBytes"
> was already taken), and it should be usable with most recent versions of
> Hadoop.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]