Hi, I updated to the version here: http://github.com/kevinweil/hadoop-lzo
However, when I use lzop for intermediate compression I am still having trouble - the reduce phase now freezes at 99% and eventually fails. No immediate problem, because I can use the default codec. But may be of concern to someone else. Thanks On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <[email protected]> wrote: > I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically > mention this potential issue so that other people can avoid such problem. > Feel free to add more onto it. > > On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <[email protected]> > wrote: >> >> Thanks everyone. >> >> Yes, using the Google Code version referenced on the wiki: >> http://wiki.apache.org/hadoop/UsingLzoCompression >> >> I will try the latest version and see if that fixes the problem. >> http://github.com/kevinweil/hadoop-lzo >> >> Thanks >> >> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[email protected]> wrote: >> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[email protected]> wrote: >> >> >> >> Todd fixed a bug where LZO header or block header data may fall on read >> >> boundary: >> >> >> >> >> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 >> >> >> >> >> >> I am wondering if that is related to the issue you saw. >> > >> > I don't think this bug would show up in intermediate output compression, >> > but >> > it's certainly possible. There have been a number of bugs fixed in LZO >> > over >> > on github - are you using the github version or the one from Google Code >> > which is out of date? Either mine or Kevin's repo on github should be a >> > good >> > version (I think we called the newest 0.3.4) >> > -Todd >> > >> >> >> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment >> >> <[email protected]> >> >> wrote: >> >>> >> >>> A little more on this. >> >>> >> >>> So, I've narrowed down the problem to using Lzop compression >> >>> (com.hadoop.compression.lzo.LzopCodec) >> >>> for mapred.map.output.compression.codec. >> >>> >> >>> <property> >> >>> <name>mapred.map.output.compression.codec</name> >> >>> <value>com.hadoop.compression.lzo.LzopCodec</value> >> >>> </property> >> >>> >> >>> If I do the above, I will get the Shuffle Error. >> >>> If I use DefaultCodec for mapred.map.output.compression.codec. >> >>> there is no problem. >> >>> >> >>> Is this a known issue? Or is this a bug? >> >>> Doesn't seem like it should be the expected behavior. >> >>> >> >>> I would be glad to contribute any further info on this if necessary. >> >>> Please let me know. >> >>> >> >>> Thanks >> >>> >> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment >> >>> <[email protected]> >> >>> wrote: >> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated. >> >>> > >> >>> > I agree that it must be a configuration problem and so today I was >> >>> > able >> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 >> >>> > node >> >>> > cluster. >> >>> > >> >>> > I've now noticed that the error occurs when compression is enabled. >> >>> > I've run the basic wordcount example as so: >> >>> > http://pastebin.com/wvDMZZT0 >> >>> > and get the Shuffle Error. >> >>> > >> >>> > TT logs show this error: >> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: >> >>> > Invalid >> >>> > header checksum: 225702cc (expected 0x2325) >> >>> > Full logs: >> >>> > http://pastebin.com/fVGjcGsW >> >>> > >> >>> > My mapred-site.xml: >> >>> > http://pastebin.com/mQgMrKQw >> >>> > >> >>> > If I remove the compression config settings, the wordcount works >> >>> > fine >> >>> > - no more Shuffle Error. >> >>> > So, I have something wrong with my compression settings I imagine. >> >>> > I'll continue looking into this to see what else I can find out. >> >>> > >> >>> > Thanks a million. >> >>> > >> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala >> >>> > <[email protected]> >> >>> > wrote: >> >>> >> Hi, >> >>> >> >> >>> >> Sorry, I couldn't take a close look at the logs until now. >> >>> >> Unfortunately, I could not see any huge difference between the >> >>> >> success >> >>> >> and failure case. Can you please check if things like basic >> >>> >> hostname - >> >>> >> ip address mapping are in place (if you have static resolution of >> >>> >> hostnames set up) ? A web search is giving this as the most likely >> >>> >> cause users have faced regarding this problem. Also do the disks >> >>> >> have >> >>> >> enough size ? Also, it would be great if you can upload your hadoop >> >>> >> configuration information. >> >>> >> >> >>> >> I do think it is very likely that configuration is the actual >> >>> >> problem >> >>> >> because it works in one case anyway. >> >>> >> >> >>> >> Thanks >> >>> >> Hemanth >> >>> >> >> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment >> >>> >> <[email protected]> wrote: >> >>> >>> Hello, >> >>> >>> I still have had no luck with this over the past week. >> >>> >>> And even get the same exact problem on a completely different 5 >> >>> >>> node >> >>> >>> cluster. >> >>> >>> Is it worth opening an new issue in jira for this? >> >>> >>> Thanks >> >>> >>> >> >>> >>> >> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment >> >>> >>> <[email protected]> wrote: >> >>> >>>> Hello, >> >>> >>>> Thanks so much for the reply. >> >>> >>>> See inline. >> >>> >>>> >> >>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala >> >>> >>>> <[email protected]> wrote: >> >>> >>>>> Hi, >> >>> >>>>> >> >>> >>>>>> I've been getting the following error when trying to run a very >> >>> >>>>>> simple >> >>> >>>>>> MapReduce job. >> >>> >>>>>> Map finishes without problem, but error occurs as soon as it >> >>> >>>>>> enters >> >>> >>>>>> Reduce phase. >> >>> >>>>>> >> >>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >> >>> >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED >> >>> >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> >>> >>>>>> >> >>> >>>>>> I am running a 5 node cluster and I believe I have all my >> >>> >>>>>> settings >> >>> >>>>>> correct: >> >>> >>>>>> >> >>> >>>>>> * ulimit -n 32768 >> >>> >>>>>> * DNS/RDNS configured properly >> >>> >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >> >>> >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >> >>> >>>>>> >> >>> >>>>>> The program is very simple - just counts a unique string in a >> >>> >>>>>> log >> >>> >>>>>> file. >> >>> >>>>>> See here: http://pastebin.com/5uRG3SFL >> >>> >>>>>> >> >>> >>>>>> When I run, the job fails and I get the following output. >> >>> >>>>>> http://pastebin.com/AhW6StEb >> >>> >>>>>> >> >>> >>>>>> However, runs fine when I do *not* use substring() on the value >> >>> >>>>>> (see >> >>> >>>>>> map function in code above). >> >>> >>>>>> >> >>> >>>>>> This runs fine and completes successfully: >> >>> >>>>>> String str = val.toString(); >> >>> >>>>>> >> >>> >>>>>> This causes error and fails: >> >>> >>>>>> String str = val.toString().substring(0,10); >> >>> >>>>>> >> >>> >>>>>> Please let me know if you need any further information. >> >>> >>>>>> It would be greatly appreciated if anyone could shed some light >> >>> >>>>>> on >> >>> >>>>>> this problem. >> >>> >>>>> >> >>> >>>>> It catches attention that changing the code to use a substring >> >>> >>>>> is >> >>> >>>>> causing a difference. Assuming it is consistent and not a red >> >>> >>>>> herring, >> >>> >>>> >> >>> >>>> Yes, this has been consistent over the last week. I was running >> >>> >>>> 0.20.1 >> >>> >>>> first and then >> >>> >>>> upgrade to 0.20.2 but results have been exactly the same. >> >>> >>>> >> >>> >>>>> can you look at the counters for the two jobs using the >> >>> >>>>> JobTracker >> >>> >>>>> web >> >>> >>>>> UI - things like map records, bytes etc and see if there is a >> >>> >>>>> noticeable difference ? >> >>> >>>> >> >>> >>>> Ok, so here is the first job using write.set(value.toString()); >> >>> >>>> having >> >>> >>>> *no* errors: >> >>> >>>> http://pastebin.com/xvy0iGwL >> >>> >>>> >> >>> >>>> And here is the second job using >> >>> >>>> write.set(value.toString().substring(0, 10)); that fails: >> >>> >>>> http://pastebin.com/uGw6yNqv >> >>> >>>> >> >>> >>>> And here is even another where I used a longer, and therefore >> >>> >>>> unique >> >>> >>>> string, >> >>> >>>> by write.set(value.toString().substring(0, 20)); This makes every >> >>> >>>> line >> >>> >>>> unique, similar to first job. >> >>> >>>> Still fails. >> >>> >>>> http://pastebin.com/GdQ1rp8i >> >>> >>>> >> >>> >>>>>Also, are the two programs being run against >> >>> >>>>> the exact same input data ? >> >>> >>>> >> >>> >>>> Yes, exactly the same input: a single csv file with 23K lines. >> >>> >>>> Using a shorter string leads to more like keys and therefore more >> >>> >>>> combining/reducing, but going >> >>> >>>> by the above it seems to fail whether the substring/key is >> >>> >>>> entirely >> >>> >>>> unique (23000 combine output records) or >> >>> >>>> mostly the same (9 combine output records). >> >>> >>>> >> >>> >>>>> >> >>> >>>>> Also, since the cluster size is small, you could also look at >> >>> >>>>> the >> >>> >>>>> tasktracker logs on the machines where the maps have run to see >> >>> >>>>> if >> >>> >>>>> there are any failures when the reduce attempts start failing. >> >>> >>>> >> >>> >>>> Here is the TT log from the last failed job. I do not see >> >>> >>>> anything >> >>> >>>> besides the shuffle failure, but there >> >>> >>>> may be something I am overlooking or simply do not understand. >> >>> >>>> http://pastebin.com/DKFTyGXg >> >>> >>>> >> >>> >>>> Thanks again! >> >>> >>>> >> >>> >>>>> >> >>> >>>>> Thanks >> >>> >>>>> Hemanth >> >>> >>>>> >> >>> >>>> >> >>> >>> >> >>> >> >> >>> > >> >> >> > >> > >> > >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > > >
