[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

Hadoop QA (JIRA) Mon, 16 Jun 2014 08:06:33 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032514#comment-14032514
 ]


Hadoop QA commented on MAPREDUCE-5018:
--------------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644886/MAPREDUCE-5018.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.
        See 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4662//artifact/trunk/patchprocess/diffJavadocWarnings.txt
 for details.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core 
hadoop-tools/hadoop-streaming.

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4662//testReport/
Console output: 
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4662//console

This message is automatically generated.

> Support raw binary data with Hadoop streaming
> ---------------------------------------------
>
>                 Key: MAPREDUCE-5018
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/streaming
>    Affects Versions: trunk, 1.1.2
>            Reporter: Jay Hacker
>            Assignee: Steven Willis
>            Priority: Minor
>         Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, 
> MAPREDUCE-5018.patch, justbytes.jar, mapstream
>
>
> People often have a need to run older programs over many files, and turn to 
> Hadoop streaming as a reliable, performant batch system.  There are good 
> reasons for this:
> 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
> it is easy to spin up a cluster in the cloud.
> 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
> 3. It is reasonably performant: it moves the code to the data, maintaining 
> locality, and scales with the number of nodes.
> Historically Hadoop is of course oriented toward processing key/value pairs, 
> and so needs to interpret the data passing through it.  Unfortunately, this 
> makes it difficult to use Hadoop streaming with programs that don't deal in 
> key/value pairs, or with binary data in general.  For example, something as 
> simple as running md5sum to verify the integrity of files will not give the 
> correct result, due to Hadoop's interpretation of the data.  
> There have been several attempts at binary serialization schemes for Hadoop 
> streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
> at efficiently encoding key/value pairs, and not passing data through 
> unmodified.  Even the "RawBytes" serialization scheme adds length fields to 
> the data, rendering it not-so-raw.
> I often have a need to run a Unix filter on files stored in HDFS; currently, 
> the only way I can do this on the raw data is to copy the data out and run 
> the filter on one machine, which is inconvenient, slow, and unreliable.  It 
> would be very convenient to run the filter as a map-only job, allowing me to 
> build on existing (well-tested!) building blocks in the Unix tradition 
> instead of reimplementing them as mapreduce programs.
> However, most existing tools don't know about file splits, and so want to 
> process whole files; and of course many expect raw binary input and output.  
> The solution is to run a map-only job with an InputFormat and OutputFormat 
> that just pass raw bytes and don't split.  It turns out to be a little more 
> complicated with streaming; I have attached a patch with the simplest 
> solution I could come up with.  I call the format "JustBytes" (as "RawBytes" 
> was already taken), and it should be usable with most recent versions of 
> Hadoop.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5018) Support raw binary data with Hadoop streaming

Reply via email to