i tried this once but before i knew it my log file was approaching a gig within an hour or so!
> I suggest maybe turning the debug logs on for hadoop before you do the > next crawl... you can do this by editing log4j.properties > and change the rootLogger from INFO to DEBUG > > On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki <a...@getopt.org> wrote: >> fa...@butterflycluster.net wrote: >>> >>> Hi there, >>> >>> seems i have some serious problems with hadoop during map-reduce for >>> MergeSegments. >>> >>> i am out of ideas on this. Any suggestions will be quite welcome. >>> >>> Here is my set up: >>> >>> RAM: 4G >>> JVM HEAP: 2G >>> mapred.child.java.opts = 1024M >>> hadoop-0.19.1-core.jar >>> nutch-1.0 >>> Xen VPS. >>> >>> After running a recrawl a few times; i end up with one segment that is >>> relatively larger compared to the new ones last generated. here is my >>> segments structure when things blow up after a (5th) recrawl; >>> >>> segment1 = 674Megs (after several recrawls) >>> segment2 = 580k (last recrawl) >>> segment3 = 568k (last recrawl) >>> segment4 = 584k (last recrawl) >>> .. >>> segment8 = 560k (last recrawl) >>> >>> when i run mergeSegments everything goes well until we get up to 90% of >>> the map-reduce and we get a thread death; here is a stack trace >>> >>> 2009-11-05 10:54:16,874 INFO [org.apache.hadoop.mapred.LocalJobRunner] >>> reduce > reduce >>> 2009-11-05 10:54:29,794 INFO [org.apache.hadoop.mapred.LocalJobRunner] >>> reduce > reduce >>> 2009-11-05 10:54:55,194 INFO [org.apache.hadoop.mapred.LocalJobRunner] >>> reduce > reduce >>> 2009-11-05 10:57:25,844 WARN [org.apache.hadoop.mapred.LocalJobRunner] >>> job_local_0001 >>> java.lang.ThreadDeath >>> at java.lang.Thread.stop(Thread.java:715) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) >>> at >>> >>> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) >>> at >>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239) >>> at >>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620) >>> at >>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665) >>> >>> any suggestions please!!!! >> >> This is a high-level exception that doesn't indicate the nature of the >> original problem. Is there any other information in hadoop.log or in >> task >> logs (logs/userlogs)? >> >> In my experience this sort of things happen rarely, for the relatively >> small >> dataset that you have, so you are lucky ;) This could be related to a >> number >> of issues, like running this under Xen that imposes some limits and >> slowdowns, or you may have a low number of file descriptors (ulimit -n), >> or >> a faulty RAM, or an overheated CPU ... >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> >