Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

bmdevelopment Fri, 09 Jul 2010 02:08:38 -0700

Hi, I updated to the version here:
http://github.com/kevinweil/hadoop-lzo


However, when I use lzop for intermediate compression I
am still having trouble - the reduce phase now freezes at 99% and
eventually fails.
No immediate problem, because I can use the default codec.
But may be of concern to someone else.

Thanks

On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <[email protected]> wrote:
> I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
> mention this potential issue so that other people can avoid such problem.
> Feel free to add more onto it.
>
> On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <[email protected]>
> wrote:
>>
>> Thanks everyone.
>>
>> Yes, using the Google Code version referenced on the wiki:
>> http://wiki.apache.org/hadoop/UsingLzoCompression
>>
>> I will try the latest version and see if that fixes the problem.
>> http://github.com/kevinweil/hadoop-lzo
>>
>> Thanks
>>
>> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[email protected]> wrote:
>> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[email protected]> wrote:
>> >>
>> >> Todd fixed a bug where LZO header or block header data may fall on read
>> >> boundary:
>> >>
>> >>
>> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58
>> >>
>> >>
>> >> I am wondering if that is related to the issue you saw.
>> >
>> > I don't think this bug would show up in intermediate output compression,
>> > but
>> > it's certainly possible. There have been a number of bugs fixed in LZO
>> > over
>> > on github - are you using the github version or the one from Google Code
>> > which is out of date? Either mine or Kevin's repo on github should be a
>> > good
>> > version (I think we called the newest 0.3.4)
>> > -Todd
>> >
>> >>
>> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment
>> >> <[email protected]>
>> >> wrote:
>> >>>
>> >>> A little more on this.
>> >>>
>> >>> So, I've narrowed down the problem to using Lzop compression
>> >>> (com.hadoop.compression.lzo.LzopCodec)
>> >>> for mapred.map.output.compression.codec.
>> >>>
>> >>> <property>
>> >>>    <name>mapred.map.output.compression.codec</name>
>> >>>    <value>com.hadoop.compression.lzo.LzopCodec</value>
>> >>> </property>
>> >>>
>> >>> If I do the above, I will get the Shuffle Error.
>> >>> If I use DefaultCodec for mapred.map.output.compression.codec.
>> >>> there is no problem.
>> >>>
>> >>> Is this a known issue? Or is this a bug?
>> >>> Doesn't seem like it should be the expected behavior.
>> >>>
>> >>> I would be glad to contribute any further info on this if necessary.
>> >>> Please let me know.
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment
>> >>> <[email protected]>
>> >>> wrote:
>> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated.
>> >>> >
>> >>> > I agree that it must be a configuration problem and so today I was
>> >>> > able
>> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5
>> >>> > node
>> >>> > cluster.
>> >>> >
>> >>> > I've now noticed that the error occurs when compression is enabled.
>> >>> > I've run the basic wordcount example as so:
>> >>> > http://pastebin.com/wvDMZZT0
>> >>> > and get the Shuffle Error.
>> >>> >
>> >>> > TT logs show this error:
>> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException:
>> >>> > Invalid
>> >>> > header checksum: 225702cc (expected 0x2325)
>> >>> > Full logs:
>> >>> > http://pastebin.com/fVGjcGsW
>> >>> >
>> >>> > My mapred-site.xml:
>> >>> > http://pastebin.com/mQgMrKQw
>> >>> >
>> >>> > If I remove the compression config settings, the wordcount works
>> >>> > fine
>> >>> > - no more Shuffle Error.
>> >>> > So, I have something wrong with my compression settings I imagine.
>> >>> > I'll continue looking into this to see what else I can find out.
>> >>> >
>> >>> > Thanks a million.
>> >>> >
>> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala
>> >>> > <[email protected]>
>> >>> > wrote:
>> >>> >> Hi,
>> >>> >>
>> >>> >> Sorry, I couldn't take a close look at the logs until now.
>> >>> >> Unfortunately, I could not see any huge difference between the
>> >>> >> success
>> >>> >> and failure case. Can you please check if things like basic
>> >>> >> hostname -
>> >>> >> ip address mapping are in place (if you have static resolution of
>> >>> >> hostnames set up) ? A web search is giving this as the most likely
>> >>> >> cause users have faced regarding this problem. Also do the disks
>> >>> >> have
>> >>> >> enough size ? Also, it would be great if you can upload your hadoop
>> >>> >> configuration information.
>> >>> >>
>> >>> >> I do think it is very likely that configuration is the actual
>> >>> >> problem
>> >>> >> because it works in one case anyway.
>> >>> >>
>> >>> >> Thanks
>> >>> >> Hemanth
>> >>> >>
>> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment
>> >>> >> <[email protected]> wrote:
>> >>> >>> Hello,
>> >>> >>> I still have had no luck with this over the past week.
>> >>> >>> And even get the same exact problem on a completely different 5
>> >>> >>> node
>> >>> >>> cluster.
>> >>> >>> Is it worth opening an new issue in jira for this?
>> >>> >>> Thanks
>> >>> >>>
>> >>> >>>
>> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment
>> >>> >>> <[email protected]> wrote:
>> >>> >>>> Hello,
>> >>> >>>> Thanks so much for the reply.
>> >>> >>>> See inline.
>> >>> >>>>
>> >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala
>> >>> >>>> <[email protected]> wrote:
>> >>> >>>>> Hi,
>> >>> >>>>>
>> >>> >>>>>> I've been getting the following error when trying to run a very
>> >>> >>>>>> simple
>> >>> >>>>>> MapReduce job.
>> >>> >>>>>> Map finishes without problem, but error occurs as soon as it
>> >>> >>>>>> enters
>> >>> >>>>>> Reduce phase.
>> >>> >>>>>>
>> >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>> >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>> >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> >>> >>>>>>
>> >>> >>>>>> I am running a 5 node cluster and I believe I have all my
>> >>> >>>>>> settings
>> >>> >>>>>> correct:
>> >>> >>>>>>
>> >>> >>>>>> * ulimit -n 32768
>> >>> >>>>>> * DNS/RDNS configured properly
>> >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>> >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>> >>> >>>>>>
>> >>> >>>>>> The program is very simple - just counts a unique string in a
>> >>> >>>>>> log
>> >>> >>>>>> file.
>> >>> >>>>>> See here: http://pastebin.com/5uRG3SFL
>> >>> >>>>>>
>> >>> >>>>>> When I run, the job fails and I get the following output.
>> >>> >>>>>> http://pastebin.com/AhW6StEb
>> >>> >>>>>>
>> >>> >>>>>> However, runs fine when I do *not* use substring() on the value
>> >>> >>>>>> (see
>> >>> >>>>>> map function in code above).
>> >>> >>>>>>
>> >>> >>>>>> This runs fine and completes successfully:
>> >>> >>>>>>            String str = val.toString();
>> >>> >>>>>>
>> >>> >>>>>> This causes error and fails:
>> >>> >>>>>>            String str = val.toString().substring(0,10);
>> >>> >>>>>>
>> >>> >>>>>> Please let me know if you need any further information.
>> >>> >>>>>> It would be greatly appreciated if anyone could shed some light
>> >>> >>>>>> on
>> >>> >>>>>> this problem.
>> >>> >>>>>
>> >>> >>>>> It catches attention that changing the code to use a substring
>> >>> >>>>> is
>> >>> >>>>> causing a difference. Assuming it is consistent and not a red
>> >>> >>>>> herring,
>> >>> >>>>
>> >>> >>>> Yes, this has been consistent over the last week. I was running
>> >>> >>>> 0.20.1
>> >>> >>>> first and then
>> >>> >>>> upgrade to 0.20.2 but results have been exactly the same.
>> >>> >>>>
>> >>> >>>>> can you look at the counters for the two jobs using the
>> >>> >>>>> JobTracker
>> >>> >>>>> web
>> >>> >>>>> UI - things like map records, bytes etc and see if there is a
>> >>> >>>>> noticeable difference ?
>> >>> >>>>
>> >>> >>>> Ok, so here is the first job using write.set(value.toString());
>> >>> >>>> having
>> >>> >>>> *no* errors:
>> >>> >>>> http://pastebin.com/xvy0iGwL
>> >>> >>>>
>> >>> >>>> And here is the second job using
>> >>> >>>> write.set(value.toString().substring(0, 10)); that fails:
>> >>> >>>> http://pastebin.com/uGw6yNqv
>> >>> >>>>
>> >>> >>>> And here is even another where I used a longer, and therefore
>> >>> >>>> unique
>> >>> >>>> string,
>> >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every
>> >>> >>>> line
>> >>> >>>> unique, similar to first job.
>> >>> >>>> Still fails.
>> >>> >>>> http://pastebin.com/GdQ1rp8i
>> >>> >>>>
>> >>> >>>>>Also, are the two programs being run against
>> >>> >>>>> the exact same input data ?
>> >>> >>>>
>> >>> >>>> Yes, exactly the same input: a single csv file with 23K lines.
>> >>> >>>> Using a shorter string leads to more like keys and therefore more
>> >>> >>>> combining/reducing, but going
>> >>> >>>> by the above it seems to fail whether the substring/key is
>> >>> >>>> entirely
>> >>> >>>> unique (23000 combine output records) or
>> >>> >>>> mostly the same (9 combine output records).
>> >>> >>>>
>> >>> >>>>>
>> >>> >>>>> Also, since the cluster size is small, you could also look at
>> >>> >>>>> the
>> >>> >>>>> tasktracker logs on the machines where the maps have run to see
>> >>> >>>>> if
>> >>> >>>>> there are any failures when the reduce attempts start failing.
>> >>> >>>>
>> >>> >>>> Here is the TT log from the last failed job. I do not see
>> >>> >>>> anything
>> >>> >>>> besides the shuffle failure, but there
>> >>> >>>> may be something I am overlooking or simply do not understand.
>> >>> >>>> http://pastebin.com/DKFTyGXg
>> >>> >>>>
>> >>> >>>> Thanks again!
>> >>> >>>>
>> >>> >>>>>
>> >>> >>>>> Thanks
>> >>> >>>>> Hemanth
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>
>

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Reply via email to