Hi I have tried ur suggestion of lowering db.max.outlinks.per.page to a smaller number. I could not reparse the segment as the segment was already parsed... I tried modifying some other variables such as java_heap memory and mapreduce_child_opts values... modifying these values triggered some exceptions.
Therefore i have generated a new segment (considering maybe something is wrong with the previous segment). and redoing the fetching process. Once this is complete then i will try to do updatedb again and see if that works... Mathan On Fri, Nov 6, 2009 at 5:40 AM, Julien Nioche <lists.digitalpeb...@gmail.com> wrote: > Hello Kalaimathan, > > Any luck with your updateDB? I would be curious to know if the tricks I > suggested worked. > > J. > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > 2009/11/3 Julien Nioche <lists.digitalpeb...@gmail.com> > >> OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I >> am pretty sure that you are running out of memory because of the inlinks >> which are stored in RAM. >> Applying the patch NUTCH-702 would also help. >> >> I have modified the CrawlDBReducer and added another parameter *db >> .fetch.links.max* : >> >> * switch (datum.getStatus()) { // collect other info >> case CrawlDatum.STATUS_LINKED: >> if (maxLinks!=-1 && linked.size()>= maxLinks) break; >> * >> where maxLinks is a variable which I initialize from the configure() method >> >> * maxLinks = job.getInt("db.fetch.links.max", -1);* >> >> I have not tried *db.max.outlinks.per.page *at all but am pretty sure that >> *db.fetch.links.max* works fine. >> >> There is also a parameter *db.max.inlinks* but it affects only the >> LinkDBMerger >> >> Let us know if that fixes the problem >> >> Julien >> -- >> DigitalPebble Ltd >> http://www.digitalpebble.com >> >> >> 2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com> >> >>> I can see that its running out of ram because... before starting >>> updatedb process i have approximately 7.7gb left on the system and as >>> soon as this starts running for some time.. the ram comes to ~48 >>> bytes... >>> >>> definitely its clogging all the ram space... >>> >>> i specified the heap size to be 9 gb.. in the hadoop-site.xml like below >>> <property> >>> <name>mapred.child.java.opts</name> >>> <value>-Xmx9096m -XX: -UseGCOverheadLimit</value> >>> </propery> >>> >>> I have attached a screenshot of the jconsole view of the updatedb >>> process... >>> From jconsole i can see that cpu is not getting used at all.. its only >>> being used 0.3~.5%. >>> >>> The system i'm using should not be a limitation because its an amd >>> 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard >>> disk space... >>> >>> Thanks again for all the help >>> >>> On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche >>> <lists.digitalpeb...@gmail.com> wrote: >>> > OK. What heapsize did you specify for this job? Could it be that you are >>> > running out of ram and GCing a lot? Still it should not take THAT long >>> > Can you see some variations in the stacktraces or are they always >>> pointing >>> > at the same things? >>> > The operations on the metadata take an awful lot of time, which I why I >>> did >>> > NUTCH-702, however that does not explain why processing a dataset this >>> size >>> > takes 20 days. >>> > >>> > J. >>> > >>> > 2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com> >>> > >>> >> I have lot of space left on the /tmp . I don't have separate partition >>> >> for /tmp... i have a folder called /tmp... There is lot of space >>> >> left.. close to 1.3Terabytes... >>> >> >>> >> 1.4T 55G 1.3T 5% / >>> >> tmpfs 3.8G 0 3.8G 0% /lib/init/rw >>> >> varrun 3.8G 120K 3.8G 1% /var/run >>> >> varlock 3.8G 0 3.8G 0% /var/lock >>> >> udev 3.8G 152K 3.8G 1% /dev >>> >> tmpfs 3.8G 0 3.8G 0% /dev/shm >>> >> lrm 3.8G 2.5M 3.8G 1% >>> >> /lib/modules/2.6.28-15-server/volatile >>> >> /dev/sda5 228M 29M 187M 14% /boot >>> >> /dev/sr0 388K 388K 0 100% /media/cdrom0 >>> >> >>> >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb... >>> >> >>> >> I have attached the jstack of the process that is doing the update.... >>> >> below >>> >> >>> >> 2009-11-02 19:11:54 >>> >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed >>> mode): >>> >> >>> >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b >>> >> waiting on condition [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10 >>> >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition >>> >> [0x00007f4000e97000] >>> >> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> >> at java.lang.Thread.sleep(Native Method) >>> >> at org.apache.hadoop.mapred.Task$1.run(Task.java:403) >>> >> at java.lang.Thread.run(Thread.java:619) >>> >> >>> >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable >>> >> [0x00007f4000f98000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> at java.lang.Byte.hashCode(Byte.java:394) >>> >> at >>> >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882) >>> >> at >>> >> >>> org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78) >>> >> - locked <0x00007f47ef4d9310> (a >>> org.apache.hadoop.io.MapWritable) >>> >> at >>> >> >>> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128) >>> >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) >>> >> at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52) >>> >> at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) >>> >> at >>> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73) >>> >> at >>> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) >>> >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >>> >> at >>> >> >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) >>> >> >>> >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0 >>> >> runnable [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf >>> >> waiting on condition [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce >>> >> waiting on condition [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd >>> >> runnable [0x0000000000000000] >>> >> java.lang.Thread.State: RUNNABLE >>> >> >>> >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in >>> >> Object.wait() [0x00007f40016f7000] >>> >> java.lang.Thread.State: WAITING (on object monitor) >>> >> at java.lang.Object.wait(Native Method) >>> >> - waiting on <0x00007f400f63e6c0> (a >>> >> java.lang.ref.ReferenceQueue$Lock) >>> >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) >>> >> - locked <0x00007f400f63e6c0> (a >>> java.lang.ref.ReferenceQueue$Lock) >>> >> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) >>> >> at >>> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) >>> >> >>> >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb >>> >> in Object.wait() [0x00007f40017f8000] >>> >> java.lang.Thread.State: WAITING (on object monitor) >>> >> at java.lang.Object.wait(Native Method) >>> >> - waiting on <0x00007f400f63e6f8> (a >>> java.lang.ref.Reference$Lock) >>> >> at java.lang.Object.wait(Object.java:485) >>> >> at >>> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) >>> >> - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock) >>> >> >>> >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition >>> >> [0x00007f49d75c2000] >>> >> java.lang.Thread.State: TIMED_WAITING (sleeping) >>> >> at java.lang.Thread.sleep(Native Method) >>> >> at >>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152) >>> >> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94) >>> >> at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189) >>> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> >> at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150) >>> >> >>> >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable >>> >> >>> >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000 >>> >> nid=0x25c6 runnable >>> >> >>> >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000 >>> >> nid=0x25c7 runnable >>> >> >>> >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000 >>> >> nid=0x25c8 runnable >>> >> >>> >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000 >>> >> nid=0x25c9 runnable >>> >> >>> >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1 >>> >> waiting on condition >>> >> >>> >> JNI global references: 907 >>> >> >>> >> >>> >> >>> >> Any help related to this would be really helpful... >>> >> >>> >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche >>> >> <lists.digitalpeb...@gmail.com> wrote: >>> >> > Hi again >>> >> > >>> >> > >>> >> >> i know the process is not stuck.. and the process is running because >>> i >>> >> >> turned on the hadoop logs and i can see logs being written to it... >>> >> >> I'm not sure how to check if the task is completely stuck or not... >>> >> >> >>> >> > >>> >> > run jps to identify the process id then *jstack id* several times to >>> see >>> >> if >>> >> > it is blocked at the same place >>> >> > >>> >> > how much space do you have left on the partition where /tmp is >>> mounted? >>> >> > >>> >> > J. >>> >> > >>> >> > >>> >> > >>> >> >> Below is the sample log as i'm sending this email.... Its been on >>> the >>> >> >> updatedb process for the last 19 days and the it has been generating >>> >> >> debug logs similar to this........ >>> >> >> >>> >> >> Has anyone else has this same issue before... >>> >> >> >>> >> >> >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$Counter with bundle >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding >>> MAP_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> MAP_OUTPUT_BYTES >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> MAP_INPUT_BYTES >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> >> MAP_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% >>> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: >>> >> >> record full = true >>> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; >>> >> >> bufend = 13893589; bufvoid = 99614720 >>> >> >> 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; >>> kvend >>> >> >> = 65533; length = 327680 >>> >> >> 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 >>> >> >> 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map >>> >> output >>> >> >> 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - >>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) >>> >> >> 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted >>> segments >>> >> >> 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last >>> >> >> merge-pass, with 5 segments left of total size: 1192 bytes >>> >> >> 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) >>> >> >> 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - >>> >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of >>> >> >> commiting >>> >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - >>> >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it >>> >> >> got interrupted >>> >> >> 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - >>> >> >> >>> >> >> >>> >> >>> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group >>> >> >> org.apache.hadoop.mapred.Task$Counter with bundle >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> MAP_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> MAP_OUTPUT_BYTES >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> MAP_INPUT_BYTES >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> >> MAP_OUTPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding >>> >> >> COMBINE_INPUT_RECORDS >>> >> >> 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task >>> >> >> 'attempt_local_0001_m_000003_0' done. >>> >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 >>> 0:0 >>> >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration - >>> >> >> java.io.IOException: config(config) >>> >> >> at >>> >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) >>> >> >> at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) >>> >> >> at >>> >> >> >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) >>> >> >> >>> >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split >>> to >>> >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta >>> >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - >>> >> >> attempt_local_0001_m_000004_0 Progress/ping thread started >>> >> >> 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 >>> >> >> 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 >>> >> >> 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% >>> >> >> >>> >> >> Mathan >>> >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <a...@getopt.org> >>> wrote: >>> >> >> > Kalaimathan Mahenthiran wrote: >>> >> >> >> >>> >> >> >> I forgot to add the detail... >>> >> >> >> >>> >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls >>> >> fetched >>> >> >> >> and 1.08 million urls parsed.. >>> >> >> >> >>> >> >> >> Any help related to this would be appreciated... >>> >> >> >> >>> >> >> >> >>> >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran >>> >> >> >> <matha...@gmail.com> wrote: >>> >> >> >>> >>> >> >> >>> hi everyone >>> >> >> >>> >>> >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently >>> on >>> >> the >>> >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I >>> don't >>> >> >> >>> know why its taking this long. I have a new machine with quad >>> core >>> >> >> >>> processor and 8 gb of ram. >>> >> >> >>> >>> >> >> >>> I believe this system is really good in terms of processing >>> power. I >>> >> >> >>> don't think processing power is the problem here. I noticed that >>> all >>> >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb >>> process. >>> >> >> >>> The computer is becoming is really slow. >>> >> >> >>> >>> >> >> >>> The updatedb process has been running for the last 19 days >>> >> continually >>> >> >> >>> with the message merging segment data into db.. Does anyone know >>> why >>> >> >> >>> its taking so long... Is there any configuration setting i can >>> do to >>> >> >> >>> increase the speed of the updatedb process... >>> >> >> > >>> >> >> > First, this process normally takes just a few minutes, depending >>> on >>> >> the >>> >> >> > hardware, and not several days - so something is wrong. >>> >> >> > >>> >> >> > * do you run this in "local" or pseudo-distributed mode (i.e. >>> running >>> >> a >>> >> >> real >>> >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, >>> because >>> >> >> then >>> >> >> > you can monitor the progress in the web UI. >>> >> >> > >>> >> >> > * how many reduce tasks do you have? with large updates it helps >>> if >>> >> you >>> >> >> run >>> >> >> >> 1 reducer, to split the final sorting. >>> >> >> > >>> >> >> > * if the task appears to be completely stuck, please generate a >>> thread >>> >> >> dump >>> >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to >>> >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these >>> are >>> >> >> > problematic by removing them from the config and re-running the >>> >> >> operation. >>> >> >> > >>> >> >> > * minor issue - when specifying the path names of segments and >>> >> crawldb, >>> >> >> do >>> >> >> > NOT append the trailing slash - it's not harmful in this >>> particular >>> >> case, >>> >> >> > but you could have a nasty surprise when doing e.g. copy / mv >>> >> operations >>> >> >> ... >>> >> >> > >>> >> >> > -- >>> >> >> > Best regards, >>> >> >> > Andrzej Bialecki <>< >>> >> >> > ___. ___ ___ ___ _ _ __________________________________ >>> >> >> > [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> >> >> > ___|||__|| \| || | Embedded Unix, System Integration >>> >> >> > http://www.sigram.com Contact: info at sigram dot com >>> >> >> > >>> >> >> > >>> >> >> >>> >> > >>> >> > >>> >> > >>> >> > -- >>> >> > DigitalPebble Ltd >>> >> > http://www.digitalpebble.com >>> >> > >>> >> >>> > >>> > >>> > >>> > -- >>> > DigitalPebble Ltd >>> > http://www.digitalpebble.com >>> > >>> >> >> >> >