OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I am
pretty sure that you are running out of memory because of the inlinks which
are stored in RAM.
Applying the patch NUTCH-702 would also help.

I have modified the CrawlDBReducer and added another parameter *db
.fetch.links.max*  :

*  switch (datum.getStatus()) {                // collect other info
      case CrawlDatum.STATUS_LINKED:
        if (maxLinks!=-1 && linked.size()>= maxLinks) break;
*
where maxLinks is a variable which I initialize from the configure() method

*    maxLinks = job.getInt("db.fetch.links.max", -1);*

I have not tried *db.max.outlinks.per.page *at all but am pretty sure that *
db.fetch.links.max* works fine.

There is also a parameter *db.max.inlinks* but it affects only the
LinkDBMerger

Let us know if that fixes the problem

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com>

> I can see that its running out of ram because... before starting
> updatedb process i have approximately 7.7gb left on the system and as
> soon as this starts running for some time.. the ram comes to ~48
> bytes...
>
> definitely its clogging all the ram space...
>
> i specified the heap size to be 9 gb.. in the hadoop-site.xml like below
> <property>
>  <name>mapred.child.java.opts</name>
>  <value>-Xmx9096m -XX: -UseGCOverheadLimit</value>
> </propery>
>
> I have attached a screenshot of the jconsole view of the updatedb
> process...
> From jconsole i can see that cpu is not getting used at all.. its only
> being used 0.3~.5%.
>
> The system i'm using should not be a limitation because its an amd
> 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard
> disk space...
>
> Thanks again for all the help
>
> On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche
> <lists.digitalpeb...@gmail.com> wrote:
> > OK. What heapsize did you specify for this job? Could it be that you are
> > running out of ram and GCing a lot? Still it should not take THAT long
> > Can you see some variations in the stacktraces or are they always
> pointing
> > at the same things?
> > The operations on the metadata take an awful lot of time, which I why I
> did
> > NUTCH-702, however that does not explain why processing a dataset this
> size
> > takes 20 days.
> >
> > J.
> >
> > 2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com>
> >
> >> I have lot of space left on the /tmp . I don't have separate partition
> >> for /tmp... i have a folder called /tmp... There is lot of space
> >> left.. close to 1.3Terabytes...
> >>
> >>                      1.4T   55G  1.3T   5% /
> >> tmpfs                 3.8G     0  3.8G   0% /lib/init/rw
> >> varrun                3.8G  120K  3.8G   1% /var/run
> >> varlock               3.8G     0  3.8G   0% /var/lock
> >> udev                  3.8G  152K  3.8G   1% /dev
> >> tmpfs                 3.8G     0  3.8G   0% /dev/shm
> >> lrm                   3.8G  2.5M  3.8G   1%
> >> /lib/modules/2.6.28-15-server/volatile
> >> /dev/sda5             228M   29M  187M  14% /boot
> >> /dev/sr0              388K  388K     0 100% /media/cdrom0
> >>
> >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb...
> >>
> >> I have attached the jstack of the process that is doing the update....
> >> below
> >>
> >> 2009-11-02 19:11:54
> >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed
> mode):
> >>
> >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b
> >> waiting on condition [0x0000000000000000]
> >>   java.lang.Thread.State: RUNNABLE
> >>
> >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10
> >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition
> >> [0x00007f4000e97000]
> >>   java.lang.Thread.State: TIMED_WAITING (sleeping)
> >>        at java.lang.Thread.sleep(Native Method)
> >>        at org.apache.hadoop.mapred.Task$1.run(Task.java:403)
> >>        at java.lang.Thread.run(Thread.java:619)
> >>
> >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable
> >> [0x00007f4000f98000]
> >>   java.lang.Thread.State: RUNNABLE
> >>        at java.lang.Byte.hashCode(Byte.java:394)
> >>        at
> >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882)
> >>        at
> >>
> org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78)
> >>        - locked <0x00007f47ef4d9310> (a
> org.apache.hadoop.io.MapWritable)
> >>        at
> >>
> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128)
> >>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
> >>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
> >>        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
> >>        at
> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73)
> >>        at
> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
> >>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
> >>        at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
> >>
> >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0
> >> runnable [0x0000000000000000]
> >>   java.lang.Thread.State: RUNNABLE
> >>
> >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf
> >> waiting on condition [0x0000000000000000]
> >>   java.lang.Thread.State: RUNNABLE
> >>
> >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce
> >> waiting on condition [0x0000000000000000]
> >>   java.lang.Thread.State: RUNNABLE
> >>
> >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd
> >> runnable [0x0000000000000000]
> >>   java.lang.Thread.State: RUNNABLE
> >>
> >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in
> >> Object.wait() [0x00007f40016f7000]
> >>   java.lang.Thread.State: WAITING (on object monitor)
> >>        at java.lang.Object.wait(Native Method)
> >>        - waiting on <0x00007f400f63e6c0> (a
> >> java.lang.ref.ReferenceQueue$Lock)
> >>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
> >>        - locked <0x00007f400f63e6c0> (a
> java.lang.ref.ReferenceQueue$Lock)
> >>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
> >>        at
> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> >>
> >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb
> >> in Object.wait() [0x00007f40017f8000]
> >>   java.lang.Thread.State: WAITING (on object monitor)
> >>        at java.lang.Object.wait(Native Method)
> >>        - waiting on <0x00007f400f63e6f8> (a
> java.lang.ref.Reference$Lock)
> >>        at java.lang.Object.wait(Object.java:485)
> >>        at
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> >>        - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock)
> >>
> >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition
> >> [0x00007f49d75c2000]
> >>   java.lang.Thread.State: TIMED_WAITING (sleeping)
> >>        at java.lang.Thread.sleep(Native Method)
> >>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152)
> >>        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
> >>        at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
> >>
> >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable
> >>
> >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000
> >> nid=0x25c6 runnable
> >>
> >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000
> >> nid=0x25c7 runnable
> >>
> >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000
> >> nid=0x25c8 runnable
> >>
> >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000
> >> nid=0x25c9 runnable
> >>
> >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1
> >> waiting on condition
> >>
> >> JNI global references: 907
> >>
> >>
> >>
> >> Any help related to this would be really helpful...
> >>
> >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche
> >> <lists.digitalpeb...@gmail.com> wrote:
> >> > Hi again
> >> >
> >> >
> >> >> i know the process is not stuck.. and the process is running because
> i
> >> >> turned on the hadoop logs and i can see logs being written to it...
> >> >> I'm not sure how to check if the task is completely stuck or not...
> >> >>
> >> >
> >> > run jps to identify the process id then *jstack id* several times to
> see
> >> if
> >> > it is blocked at the same place
> >> >
> >> > how much space do you have left on the partition where /tmp is
> mounted?
> >> >
> >> > J.
> >> >
> >> >
> >> >
> >> >> Below is the sample log as i'm sending this email.... Its been on the
> >> >> updatedb process for the last 19 days and the it has been generating
> >> >> debug logs similar to this........
> >> >>
> >> >> Has anyone else has this same issue before...
> >> >>
> >> >>
> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ
> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE
> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
> >> >> org.apache.hadoop.mapred.Task$Counter with bundle
> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding
> >> >> COMBINE_OUTPUT_RECORDS
> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding
> MAP_INPUT_RECORDS
> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
> MAP_OUTPUT_BYTES
> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
> MAP_INPUT_BYTES
> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
> >> MAP_OUTPUT_RECORDS
> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
> >> >> COMBINE_INPUT_RECORDS
> >> >> 2009-11-02 13:34:21,643 INFO  mapred.JobClient -  map 93% reduce 0%
> >> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - Spilling map output:
> >> >> record full = true
> >> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - bufstart = 10420198;
> >> >> bufend = 13893589; bufvoid = 99614720
> >> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - kvstart = 131070;
> kvend
> >> >> = 65533; length = 327680
> >> >> 2009-11-02 13:34:22,427 INFO  mapred.MapTask - Finished spill 3
> >> >> 2009-11-02 13:34:23,301 INFO  mapred.MapTask - Starting flush of map
> >> output
> >> >> 2009-11-02 13:34:23,384 INFO  mapred.MapTask - Finished spill 4
> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228)
> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246)
> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246)
> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246)
> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246)
> >> >> 2009-11-02 13:34:23,390 INFO  mapred.Merger - Merging 5 sorted
> segments
> >> >> 2009-11-02 13:34:23,392 INFO  mapred.Merger - Down to the last
> >> >> merge-pass, with 5 segments left of total size: 1192 bytes
> >> >> 2009-11-02 13:34:23,393 INFO  mapred.MapTask - Index: (0, 354, 358)
> >> >> 2009-11-02 13:34:23,394 INFO  mapred.TaskRunner -
> >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of
> >> >> commiting
> >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner -
> >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it
> >> >> got interrupted
> >> >> 2009-11-02 13:34:23,395 INFO  mapred.LocalJobRunner -
> >> >>
> >> >>
> >>
> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
> >> >> org.apache.hadoop.mapred.Task$Counter with bundle
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> >> >> COMBINE_OUTPUT_RECORDS
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> MAP_INPUT_RECORDS
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> MAP_OUTPUT_BYTES
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> MAP_INPUT_BYTES
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> >> MAP_OUTPUT_RECORDS
> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> >> >> COMBINE_INPUT_RECORDS
> >> >> 2009-11-02 13:34:23,397 INFO  mapred.TaskRunner - Task
> >> >> 'attempt_local_0001_m_000003_0' done.
> >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0
> 0:0
> >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration -
> >> >> java.io.IOException: config(config)
> >> >>        at
> >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192)
> >> >>        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139)
> >> >>        at
> >> >>
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
> >> >>
> >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to
> >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta
> >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner -
> >> >> attempt_local_0001_m_000004_0 Progress/ping thread started
> >> >> 2009-11-02 13:34:23,452 INFO  mapred.MapTask - numReduceTasks: 1
> >> >> 2009-11-02 13:34:23,453 INFO  mapred.MapTask - io.sort.mb = 100
> >> >> 2009-11-02 13:34:23,644 INFO  mapred.JobClient -  map 100% reduce 0%
> >> >>
> >> >> Mathan
> >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <a...@getopt.org>
> wrote:
> >> >> > Kalaimathan Mahenthiran wrote:
> >> >> >>
> >> >> >> I forgot to add the detail...
> >> >> >>
> >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls
> >> fetched
> >> >> >> and 1.08 million urls parsed..
> >> >> >>
> >> >> >> Any help related to this would be appreciated...
> >> >> >>
> >> >> >>
> >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
> >> >> >> <matha...@gmail.com> wrote:
> >> >> >>>
> >> >> >>> hi everyone
> >> >> >>>
> >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently on
> >> the
> >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I
> don't
> >> >> >>> know why its taking this long. I have a new machine with quad
> core
> >> >> >>> processor and 8 gb of ram.
> >> >> >>>
> >> >> >>> I believe this system is really good in terms of processing
> power. I
> >> >> >>> don't think processing power is the problem here. I noticed that
> all
> >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb
> process.
> >> >> >>> The computer is becoming is really slow.
> >> >> >>>
> >> >> >>> The updatedb process has been running for the last 19 days
> >> continually
> >> >> >>> with the message merging segment data into db.. Does anyone know
> why
> >> >> >>> its taking so long... Is there any configuration setting i can do
> to
> >> >> >>> increase the speed of the updatedb process...
> >> >> >
> >> >> > First, this process normally takes just a few minutes, depending on
> >> the
> >> >> > hardware, and not several days - so something is wrong.
> >> >> >
> >> >> > * do you run this in "local" or pseudo-distributed mode (i.e.
> running
> >> a
> >> >> real
> >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode,
> because
> >> >> then
> >> >> > you can monitor the progress in the web UI.
> >> >> >
> >> >> > * how many reduce tasks do you have? with large updates it helps if
> >> you
> >> >> run
> >> >> >> 1 reducer, to split the final sorting.
> >> >> >
> >> >> > * if the task appears to be completely stuck, please generate a
> thread
> >> >> dump
> >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to
> >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these
> are
> >> >> > problematic by removing them from the config and re-running the
> >> >> operation.
> >> >> >
> >> >> > * minor issue - when specifying the path names of segments and
> >> crawldb,
> >> >> do
> >> >> > NOT append the trailing slash - it's not harmful in this particular
> >> case,
> >> >> > but you could have a nasty surprise when doing e.g. copy / mv
> >> operations
> >> >> ...
> >> >> >
> >> >> > --
> >> >> > Best regards,
> >> >> > Andrzej Bialecki     <><
> >> >> >  ___. ___ ___ ___ _ _   __________________________________
> >> >> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> >> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> >> > http://www.sigram.com  Contact: info at sigram dot com
> >> >> >
> >> >> >
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > DigitalPebble Ltd
> >> > http://www.digitalpebble.com
> >> >
> >>
> >
> >
> >
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
>

Reply via email to