[ https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13654863#comment-13654863 ]
Jay Hacker commented on MAPREDUCE-5018: --------------------------------------- [~pratem], you're right, there are cases where it's not efficient. Consider this though: if you have 100 TB of files in HDFS that you want to md5sum (or what have you), would you rather do an inefficient distributed md5sum on the cluster, or copy 100 TB out to a single machine and wait for a single md5sum? Can you even fit that on one machine? You still gain reliability: there are multiple copies of each file, and failed jobs get restarted. It's also just convenient. Here's the trick to make it efficient: use many files, and set the block size of individual files big enough to fit the whole file: {{hadoop fs -D dfs.block.size=1073741824 -put ...}} Then all reads are local, and you get all the performance Hadoop can give you. > Support raw binary data with Hadoop streaming > --------------------------------------------- > > Key: MAPREDUCE-5018 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: contrib/streaming > Reporter: Jay Hacker > Priority: Minor > Attachments: justbytes.jar, MAPREDUCE-5018.patch, mapstream > > > People often have a need to run older programs over many files, and turn to > Hadoop streaming as a reliable, performant batch system. There are good > reasons for this: > 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and > it is easy to spin up a cluster in the cloud. > 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs. > 3. It is reasonably performant: it moves the code to the data, maintaining > locality, and scales with the number of nodes. > Historically Hadoop is of course oriented toward processing key/value pairs, > and so needs to interpret the data passing through it. Unfortunately, this > makes it difficult to use Hadoop streaming with programs that don't deal in > key/value pairs, or with binary data in general. For example, something as > simple as running md5sum to verify the integrity of files will not give the > correct result, due to Hadoop's interpretation of the data. > There have been several attempts at binary serialization schemes for Hadoop > streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed > at efficiently encoding key/value pairs, and not passing data through > unmodified. Even the "RawBytes" serialization scheme adds length fields to > the data, rendering it not-so-raw. > I often have a need to run a Unix filter on files stored in HDFS; currently, > the only way I can do this on the raw data is to copy the data out and run > the filter on one machine, which is inconvenient, slow, and unreliable. It > would be very convenient to run the filter as a map-only job, allowing me to > build on existing (well-tested!) building blocks in the Unix tradition > instead of reimplementing them as mapreduce programs. > However, most existing tools don't know about file splits, and so want to > process whole files; and of course many expect raw binary input and output. > The solution is to run a map-only job with an InputFormat and OutputFormat > that just pass raw bytes and don't split. It turns out to be a little more > complicated with streaming; I have attached a patch with the simplest > solution I could come up with. I call the format "JustBytes" (as "RawBytes" > was already taken), and it should be usable with most recent versions of > Hadoop. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira