Hi John, I had the same issue before but never found a solution. Here is a workaround mentioned by someone in this mailing list, you may have a try:
Seemingly abnormal temp space use by segment merger http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 Regards, Justin On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak <j...@beforedawnsolutions.com > wrote: > Ok. > > So a update to this item. > > I did start running nutch with hadoop, I am trying a single node config > just to test it out. > > It took forever to get all of the files in the DFS it was just over 80GB > but it is in there. So I started the SegmentMerge job, and it is working > flawlessly, still a little slow though. > > Also looking at the stats for the CPU they sometimes go over 20% by not by > much and not often, the Disk is very lightly taxed, peak was about 20 > MB/sec, the drives and interface are rated at 3 GB/sec, so no issue there. > > I tried to set the map jobs to 7 and the reduce jobs to 3, but when I > restarted all it is still only using 2 and 1. Any ideas? I made that > change in the hadoop-site.xml file BTW. > > -John > > > On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: > > John Martyniak wrote: >> >>> Andrzej, >>> I am a little embarassed asking. But is there is a setup guide for >>> setting up Hadoop for Nutch 1.0, or is it the same process as setting up for >>> Nutch 0.17 (Which I think is the existing guide out there). >>> >> >> Basically, yes - but this guide is primarily about the set up of Hadoop >> cluster using the Hadoop pieces distributed with Nutch. As such these >> instructions are already slightly outdated. So it's best simply to install a >> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, and then >> build nutch*.job file separately. >> >> Also I have Hadoop already running for some other applications, not >>> associated with Nutch, can I use the same install? I think that it is the >>> same version that Nutch 1.0 uses. Or is it just easier to set it up using >>> the nutch config. >>> >> >> Yes, it's perfectly ok to use Nutch with an existing Hadoop cluster of the >> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would strongly >> recommend this way, instead of the usual "dirty" way of setting up Nutch by >> replicating the local build dir ;) >> >> Just specify the nutch*.job file like this: >> >> bin/hadoop jar nutch*.job <className> <args ..> >> >> where className and args is one of Nutch command-line tools. You can also >> modify slightly the bin/nutch script, so that you don't have to specify >> fully-qualified class names. >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >