To quote the docs:

---
This feature can be used when map/reduce tasks crashes deterministically on certain input. This happens due to bugs in the map/reduce function. The usual course would be to fix these bugs. But sometimes this is not possible; perhaps the bug is in third party libraries for which the source code is not available. Due to this, the task never reaches to completion even with multiple attempts and complete data for that task is lost.

With this feature, only a small portion of data is lost surrounding the bad record, which may be acceptable for some user applications. see setMapperMaxSkipRecords(Configuration, long)
---

Basically, it's a heavy-handed approach that you should only use as a last resort.

Daniel


On 4/13/17 3:24 PM, Pillis W wrote:
Thanks Daniel.

Please correct me if I have understood this incorrectly, but according to the documentation at http://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records , it seemed like the sole purpose of this functionality is to tolerate unknown failures/exceptions in mappers/reducers. If I was able to catch all failures, I do not need to even use this ability - is that not true?

If I have understood it incorrectly, when would one use the feature to skip bad records?

Regards,
PW




On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <[email protected] <mailto:[email protected]>> wrote:

    You have to modify wordcount-mapper-t1.py to just ignore the bad
    line.  In the worst case, you should be able to do something like:

    for line in sys.stdin:
      try:
        # Insert processing code here
      except:
        # Error processing record, ignore it
        pass

    Daniel


    On 4/13/17 1:33 PM, Pillis W wrote:

        Hello,
        I am using 'hadoop-streaming.jar' to do a simple word count,
        and want to
        skip records that fail execution. Below is the actual command
        I run, and
        the mapper always fails on one record, and hence fails the
        job. The input
        file is 3 lines with 1 bad line.

        hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D
        mapred.job.name <http://mapred.job.name>=SkipTest
        -Dmapreduce.task.skip.start.at
        <http://Dmapreduce.task.skip.start.at>tempts=1
        -Dmapreduce.map.skip.maxrecords=1
        -Dmapreduce.reduce.skip.maxgroups=1
        -Dmapreduce.map.skip.proc.count.autoincr=false
        -Dmapreduce.reduce.skip.proc.count.autoincr=false -D
        mapred.reduce.tasks=1
        -D mapred.map.tasks=1 -files
        
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
        -input /user/hadoop/data/test1 -output
        /user/hadoop/data/output-test-5
        -mapper "python wordcount-mapper-t1.py" -reducer "python
        wordcount-reducer-t1.py"


        I was wondering if skipping of records is supported when
        MapReduce is used
        in streaming mode?

        Thanks in advance.
        PW



    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    [email protected]
    <mailto:[email protected]>
    For additional commands, e-mail:
    [email protected]
    <mailto:[email protected]>




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to