OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I am pretty sure that you are running out of memory because of the inlinks which are stored in RAM. Applying the patch NUTCH-702 would also help.
I have modified the CrawlDBReducer and added another parameter *db .fetch.links.max* : * switch (datum.getStatus()) { // collect other info case CrawlDatum.STATUS_LINKED: if (maxLinks!=-1 && linked.size()>= maxLinks) break; * where maxLinks is a variable which I initialize from the configure() method * maxLinks = job.getInt("db.fetch.links.max", -1);* I have not tried *db.max.outlinks.per.page *at all but am pretty sure that * db.fetch.links.max* works fine. There is also a parameter *db.max.inlinks* but it affects only the LinkDBMerger Let us know if that fixes the problem Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com> > I can see that its running out of ram because... before starting > updatedb process i have approximately 7.7gb left on the system and as > soon as this starts running for some time.. the ram comes to ~48 > bytes... > > definitely its clogging all the ram space... > > i specified the heap size to be 9 gb.. in the hadoop-site.xml like below > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx9096m -XX: -UseGCOverheadLimit</value> > </propery> > > I have attached a screenshot of the jconsole view of the updatedb > process... > From jconsole i can see that cpu is not getting used at all.. its only > being used 0.3~.5%. > > The system i'm using should not be a limitation because its an amd > 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard > disk space... > > Thanks again for all the help > > On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche > <lists.digitalpeb...@gmail.com> wrote: > > OK. What heapsize did you specify for this job? Could it be that you are > > running out of ram and GCing a lot? Still it should not take THAT long > > Can you see some variations in the stacktraces or are they always > pointing > > at the same things? > > The operations on the metadata take an awful lot of time, which I why I > did > > NUTCH-702, however that does not explain why processing a dataset this > size > > takes 20 days. > > > > J. > > > > 2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com> > > > >> I have lot of space left on the /tmp . I don't have separate partition > >> for /tmp... i have a folder called /tmp... There is lot of space > >> left.. close to 1.3Terabytes... > >> > >> 1.4T 55G 1.3T 5% / > >> tmpfs 3.8G 0 3.8G 0% /lib/init/rw > >> varrun 3.8G 120K 3.8G 1% /var/run > >> varlock 3.8G 0 3.8G 0% /var/lock > >> udev 3.8G 152K 3.8G 1% /dev > >> tmpfs 3.8G 0 3.8G 0% /dev/shm > >> lrm 3.8G 2.5M 3.8G 1% > >> /lib/modules/2.6.28-15-server/volatile > >> /dev/sda5 228M 29M 187M 14% /boot > >> /dev/sr0 388K 388K 0 100% /media/cdrom0 > >> > >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb... > >> > >> I have attached the jstack of the process that is doing the update.... > >> below > >> > >> 2009-11-02 19:11:54 > >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed > mode): > >> > >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 > >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition > >> [0x00007f4000e97000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.apache.hadoop.mapred.Task$1.run(Task.java:403) > >> at java.lang.Thread.run(Thread.java:619) > >> > >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable > >> [0x00007f4000f98000] > >> java.lang.Thread.State: RUNNABLE > >> at java.lang.Byte.hashCode(Byte.java:394) > >> at > >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) > >> at > >> > org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) > >> - locked <0x00007f47ef4d9310> (a > org.apache.hadoop.io.MapWritable) > >> at > >> > org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) > >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) > >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) > >> at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > >> at > >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) > >> at > >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > >> at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) > >> > >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 > >> runnable [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce > >> waiting on condition [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd > >> runnable [0x0000000000000000] > >> java.lang.Thread.State: RUNNABLE > >> > >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in > >> Object.wait() [0x00007f40016f7000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> - waiting on <0x00007f400f63e6c0> (a > >> java.lang.ref.ReferenceQueue$Lock) > >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) > >> - locked <0x00007f400f63e6c0> (a > java.lang.ref.ReferenceQueue$Lock) > >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) > >> at > java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > >> > >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb > >> in Object.wait() [0x00007f40017f8000] > >> java.lang.Thread.State: WAITING (on object monitor) > >> at java.lang.Object.wait(Native Method) > >> - waiting on <0x00007f400f63e6f8> (a > java.lang.ref.Reference$Lock) > >> at java.lang.Object.wait(Object.java:485) > >> at > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > >> - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) > >> > >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition > >> [0x00007f49d75c2000] > >> java.lang.Thread.State: TIMED_WAITING (sleeping) > >> at java.lang.Thread.sleep(Native Method) > >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) > >> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) > >> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) > >> > >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable > >> > >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 > >> nid=0x25c6 runnable > >> > >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 > >> nid=0x25c7 runnable > >> > >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 > >> nid=0x25c8 runnable > >> > >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 > >> nid=0x25c9 runnable > >> > >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 > >> waiting on condition > >> > >> JNI global references: 907 > >> > >> > >> > >> Any help related to this would be really helpful... > >> > >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche > >> <lists.digitalpeb...@gmail.com> wrote: > >> > Hi again > >> > > >> > > >> >> i know the process is not stuck.. and the process is running because > i > >> >> turned on the hadoop logs and i can see logs being written to it... > >> >> I'm not sure how to check if the task is completely stuck or not... > >> >> > >> > > >> > run jps to identify the process id then *jstack id* several times to > see > >> if > >> > it is blocked at the same place > >> > > >> > how much space do you have left on the partition where /tmp is > mounted? > >> > > >> > J. > >> > > >> > > >> > > >> >> Below is the sample log as i'm sending this email.... Its been on the > >> >> updatedb process for the last 19 days and the it has been generating > >> >> debug logs similar to this........ > >> >> > >> >> Has anyone else has this same issue before... > >> >> > >> >> > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$Counter with bundle > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding > >> >> COMBINE_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding > MAP_INPUT_RECORDS > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > MAP_OUTPUT_BYTES > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > MAP_INPUT_BYTES > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > >> MAP_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding > >> >> COMBINE_INPUT_RECORDS > >> >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% > >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: > >> >> record full = true > >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; > >> >> bufend = 13893589; bufvoid = 99614720 > >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; > kvend > >> >> = 65533; length = 327680 > >> >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 > >> >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map > >> output > >> >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) > >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - > >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) > >> >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted > segments > >> >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last > >> >> merge-pass, with 5 segments left of total size: 1192 bytes > >> >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) > >> >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - > >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of > >> >> commiting > >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - > >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it > >> >> got interrupted > >> >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - > >> >> > >> >> > >> > file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group > >> >> org.apache.hadoop.mapred.Task$Counter with bundle > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> >> COMBINE_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > MAP_INPUT_RECORDS > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > MAP_OUTPUT_BYTES > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > MAP_INPUT_BYTES > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> MAP_OUTPUT_RECORDS > >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding > >> >> COMBINE_INPUT_RECORDS > >> >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task > >> >> 'attempt_local_0001_m_000003_0' done. > >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 > 0:0 > >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - > >> >> java.io.IOException: config(config) > >> >> at > >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) > >> >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) > >> >> at > >> >> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) > >> >> > >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to > >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta > >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - > >> >> attempt_local_0001_m_000004_0 Progress/ping thread started > >> >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 > >> >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 > >> >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% > >> >> > >> >> Mathan > >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <a...@getopt.org> > wrote: > >> >> > Kalaimathan Mahenthiran wrote: > >> >> >> > >> >> >> I forgot to add the detail... > >> >> >> > >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls > >> fetched > >> >> >> and 1.08 million urls parsed.. > >> >> >> > >> >> >> Any help related to this would be appreciated... > >> >> >> > >> >> >> > >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran > >> >> >> <matha...@gmail.com> wrote: > >> >> >>> > >> >> >>> hi everyone > >> >> >>> > >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently on > >> the > >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I > don't > >> >> >>> know why its taking this long. I have a new machine with quad > core > >> >> >>> processor and 8 gb of ram. > >> >> >>> > >> >> >>> I believe this system is really good in terms of processing > power. I > >> >> >>> don't think processing power is the problem here. I noticed that > all > >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb > process. > >> >> >>> The computer is becoming is really slow. > >> >> >>> > >> >> >>> The updatedb process has been running for the last 19 days > >> continually > >> >> >>> with the message merging segment data into db.. Does anyone know > why > >> >> >>> its taking so long... Is there any configuration setting i can do > to > >> >> >>> increase the speed of the updatedb process... > >> >> > > >> >> > First, this process normally takes just a few minutes, depending on > >> the > >> >> > hardware, and not several days - so something is wrong. > >> >> > > >> >> > * do you run this in "local" or pseudo-distributed mode (i.e. > running > >> a > >> >> real > >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, > because > >> >> then > >> >> > you can monitor the progress in the web UI. > >> >> > > >> >> > * how many reduce tasks do you have? with large updates it helps if > >> you > >> >> run > >> >> >> 1 reducer, to split the final sorting. > >> >> > > >> >> > * if the task appears to be completely stuck, please generate a > thread > >> >> dump > >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to > >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these > are > >> >> > problematic by removing them from the config and re-running the > >> >> operation. > >> >> > > >> >> > * minor issue - when specifying the path names of segments and > >> crawldb, > >> >> do > >> >> > NOT append the trailing slash - it's not harmful in this particular > >> case, > >> >> > but you could have a nasty surprise when doing e.g. copy / mv > >> operations > >> >> ... > >> >> > > >> >> > -- > >> >> > Best regards, > >> >> > Andrzej Bialecki <>< > >> >> > ___. ___ ___ ___ _ _ __________________________________ > >> >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > >> >> > ___|||__|| \| || | Embedded Unix, System Integration > >> >> > http://www.sigram.com Contact: info at sigram dot com > >> >> > > >> >> > > >> >> > >> > > >> > > >> > > >> > -- > >> > DigitalPebble Ltd > >> > http://www.digitalpebble.com > >> > > >> > > > > > > > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > >