Re: Skip bad records when streaming supported?

Daniel Templeton Thu, 13 Apr 2017 15:28:56 -0700

To quote the docs:

---

This feature can be used when map/reduce tasks crashes deterministicallyon certain input. This happens due to bugs in the map/reduce function.The usual course would be to fix these bugs. But sometimes this is notpossible; perhaps the bug is in third party libraries for which thesource code is not available. Due to this, the task never reaches tocompletion even with multiple attempts and complete data for that taskis lost.

With this feature, only a small portion of data is lost surrounding thebad record, which may be acceptable for some user applications. seesetMapperMaxSkipRecords(Configuration, long)

---

Basically, it's a heavy-handed approach that you should only use as alast resort.


Daniel


On 4/13/17 3:24 PM, Pillis W wrote:

Thanks Daniel.

Please correct me if I have understood this incorrectly, but accordingto the documentation athttp://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Skipping_Bad_Records, it seemed like the sole purpose of this functionality is to tolerateunknown failures/exceptions in mappers/reducers. If I was able tocatch all failures, I do not need to even use this ability - is thatnot true?

If I have understood it incorrectly, when would one use the feature toskip bad records?


Regards,
PW

On Thu, Apr 13, 2017 at 2:49 PM, Daniel Templeton <[email protected]<mailto:[email protected]>> wrote:


    You have to modify wordcount-mapper-t1.py to just ignore the bad
    line.  In the worst case, you should be able to do something like:

    for line in sys.stdin:
      try:
        # Insert processing code here
      except:
        # Error processing record, ignore it
        pass

    Daniel


    On 4/13/17 1:33 PM, Pillis W wrote:

        Hello,
        I am using 'hadoop-streaming.jar' to do a simple word count,
        and want to
        skip records that fail execution. Below is the actual command
        I run, and
        the mapper always fails on one record, and hence fails the
        job. The input
        file is 3 lines with 1 bad line.

        hadoop jar /usr/lib/hadoop/hadoop-streaming.jar -D
        mapred.job.name <http://mapred.job.name>=SkipTest
        -Dmapreduce.task.skip.start.at
        <http://Dmapreduce.task.skip.start.at>tempts=1
        -Dmapreduce.map.skip.maxrecords=1
        -Dmapreduce.reduce.skip.maxgroups=1
        -Dmapreduce.map.skip.proc.count.autoincr=false
        -Dmapreduce.reduce.skip.proc.count.autoincr=false -D
        mapred.reduce.tasks=1
        -D mapred.map.tasks=1 -files
        
/home/hadoop/wc/wordcount-mapper-t1.py,/home/hadoop/wc/wordcount-reducer-t1.py
        -input /user/hadoop/data/test1 -output
        /user/hadoop/data/output-test-5
        -mapper "python wordcount-mapper-t1.py" -reducer "python
        wordcount-reducer-t1.py"


        I was wondering if skipping of records is supported when
        MapReduce is used
        in streaming mode?

        Thanks in advance.
        PW



    ---------------------------------------------------------------------
    To unsubscribe, e-mail:
    [email protected]
    <mailto:[email protected]>
    For additional commands, e-mail:
    [email protected]
    <mailto:[email protected]>



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Skip bad records when streaming supported?

Reply via email to