Hi

I have tried ur suggestion of lowering db.max.outlinks.per.page to a
smaller number. I could not reparse the segment as the segment was
already parsed... I tried modifying some other variables such as
java_heap memory and mapreduce_child_opts values... modifying these
values triggered some exceptions.

Therefore i have generated a new segment (considering maybe something
is wrong with the previous segment). and redoing the fetching process.
Once this is complete then i will try to do updatedb again and see if
that works...

Mathan
On Fri, Nov 6, 2009 at 5:40 AM, Julien Nioche
<lists.digitalpeb...@gmail.com> wrote:
> Hello Kalaimathan,
>
> Any luck with your updateDB? I would be curious to know if the tricks I
> suggested worked.
>
> J.
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/11/3 Julien Nioche <lists.digitalpeb...@gmail.com>
>
>> OK. Try reparsing and set a lower value to *db.max.outlinks.per.page*. I
>> am pretty sure that you are running out of memory because of the inlinks
>> which are stored in RAM.
>> Applying the patch NUTCH-702 would also help.
>>
>> I have modified the CrawlDBReducer and added another parameter *db
>> .fetch.links.max*  :
>>
>> *  switch (datum.getStatus()) {                // collect other info
>>       case CrawlDatum.STATUS_LINKED:
>>         if (maxLinks!=-1 && linked.size()>= maxLinks) break;
>> *
>> where maxLinks is a variable which I initialize from the configure() method
>>
>> *    maxLinks = job.getInt("db.fetch.links.max", -1);*
>>
>> I have not tried *db.max.outlinks.per.page *at all but am pretty sure that
>> *db.fetch.links.max* works fine.
>>
>> There is also a parameter *db.max.inlinks* but it affects only the
>> LinkDBMerger
>>
>> Let us know if that fixes the problem
>>
>> Julien
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>>
>> 2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com>
>>
>>> I can see that its running out of ram because... before starting
>>> updatedb process i have approximately 7.7gb left on the system and as
>>> soon as this starts running for some time.. the ram comes to ~48
>>> bytes...
>>>
>>> definitely its clogging all the ram space...
>>>
>>> i specified the heap size to be 9 gb.. in the hadoop-site.xml like below
>>> <property>
>>>  <name>mapred.child.java.opts</name>
>>>  <value>-Xmx9096m -XX: -UseGCOverheadLimit</value>
>>> </propery>
>>>
>>> I have attached a screenshot of the jconsole view of the updatedb
>>> process...
>>> From jconsole i can see that cpu is not getting used at all.. its only
>>> being used 0.3~.5%.
>>>
>>> The system i'm using should not be a limitation because its an amd
>>> 64bit quad core processor with 8gbs of Ram and 1.5 Terabytes of hard
>>> disk space...
>>>
>>> Thanks again for all the help
>>>
>>> On Tue, Nov 3, 2009 at 4:15 AM, Julien Nioche
>>> <lists.digitalpeb...@gmail.com> wrote:
>>> > OK. What heapsize did you specify for this job? Could it be that you are
>>> > running out of ram and GCing a lot? Still it should not take THAT long
>>> > Can you see some variations in the stacktraces or are they always
>>> pointing
>>> > at the same things?
>>> > The operations on the metadata take an awful lot of time, which I why I
>>> did
>>> > NUTCH-702, however that does not explain why processing a dataset this
>>> size
>>> > takes 20 days.
>>> >
>>> > J.
>>> >
>>> > 2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com>
>>> >
>>> >> I have lot of space left on the /tmp . I don't have separate partition
>>> >> for /tmp... i have a folder called /tmp... There is lot of space
>>> >> left.. close to 1.3Terabytes...
>>> >>
>>> >>                      1.4T   55G  1.3T   5% /
>>> >> tmpfs                 3.8G     0  3.8G   0% /lib/init/rw
>>> >> varrun                3.8G  120K  3.8G   1% /var/run
>>> >> varlock               3.8G     0  3.8G   0% /var/lock
>>> >> udev                  3.8G  152K  3.8G   1% /dev
>>> >> tmpfs                 3.8G     0  3.8G   0% /dev/shm
>>> >> lrm                   3.8G  2.5M  3.8G   1%
>>> >> /lib/modules/2.6.28-15-server/volatile
>>> >> /dev/sda5             228M   29M  187M  14% /boot
>>> >> /dev/sr0              388K  388K     0 100% /media/cdrom0
>>> >>
>>> >> I also noticed that /tmp/hadoop-root directory is 6.8 Gb...
>>> >>
>>> >> I have attached the jstack of the process that is doing the update....
>>> >> below
>>> >>
>>> >> 2009-11-02 19:11:54
>>> >> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed
>>> mode):
>>> >>
>>> >> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b
>>> >> waiting on condition [0x0000000000000000]
>>> >>   java.lang.Thread.State: RUNNABLE
>>> >>
>>> >> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10
>>> >> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition
>>> >> [0x00007f4000e97000]
>>> >>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>>> >>        at java.lang.Thread.sleep(Native Method)
>>> >>        at org.apache.hadoop.mapred.Task$1.run(Task.java:403)
>>> >>        at java.lang.Thread.run(Thread.java:619)
>>> >>
>>> >> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable
>>> >> [0x00007f4000f98000]
>>> >>   java.lang.Thread.State: RUNNABLE
>>> >>        at java.lang.Byte.hashCode(Byte.java:394)
>>> >>        at
>>> >> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882)
>>> >>        at
>>> >>
>>> org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78)
>>> >>        - locked <0x00007f47ef4d9310> (a
>>> org.apache.hadoop.io.MapWritable)
>>> >>        at
>>> >>
>>> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128)
>>> >>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
>>> >>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
>>> >>        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>>> >>        at
>>> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73)
>>> >>        at
>>> >> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>>> >>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>>> >>        at
>>> >>
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
>>> >>
>>> >> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0
>>> >> runnable [0x0000000000000000]
>>> >>   java.lang.Thread.State: RUNNABLE
>>> >>
>>> >> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf
>>> >> waiting on condition [0x0000000000000000]
>>> >>   java.lang.Thread.State: RUNNABLE
>>> >>
>>> >> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce
>>> >> waiting on condition [0x0000000000000000]
>>> >>   java.lang.Thread.State: RUNNABLE
>>> >>
>>> >> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd
>>> >> runnable [0x0000000000000000]
>>> >>   java.lang.Thread.State: RUNNABLE
>>> >>
>>> >> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in
>>> >> Object.wait() [0x00007f40016f7000]
>>> >>   java.lang.Thread.State: WAITING (on object monitor)
>>> >>        at java.lang.Object.wait(Native Method)
>>> >>        - waiting on <0x00007f400f63e6c0> (a
>>> >> java.lang.ref.ReferenceQueue$Lock)
>>> >>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>>> >>        - locked <0x00007f400f63e6c0> (a
>>> java.lang.ref.ReferenceQueue$Lock)
>>> >>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>>> >>        at
>>> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>>> >>
>>> >> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb
>>> >> in Object.wait() [0x00007f40017f8000]
>>> >>   java.lang.Thread.State: WAITING (on object monitor)
>>> >>        at java.lang.Object.wait(Native Method)
>>> >>        - waiting on <0x00007f400f63e6f8> (a
>>> java.lang.ref.Reference$Lock)
>>> >>        at java.lang.Object.wait(Object.java:485)
>>> >>        at
>>> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>>> >>        - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock)
>>> >>
>>> >> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition
>>> >> [0x00007f49d75c2000]
>>> >>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>>> >>        at java.lang.Thread.sleep(Native Method)
>>> >>        at
>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152)
>>> >>        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
>>> >>        at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
>>> >>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> >>        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
>>> >>
>>> >> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable
>>> >>
>>> >> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000
>>> >> nid=0x25c6 runnable
>>> >>
>>> >> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000
>>> >> nid=0x25c7 runnable
>>> >>
>>> >> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000
>>> >> nid=0x25c8 runnable
>>> >>
>>> >> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000
>>> >> nid=0x25c9 runnable
>>> >>
>>> >> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1
>>> >> waiting on condition
>>> >>
>>> >> JNI global references: 907
>>> >>
>>> >>
>>> >>
>>> >> Any help related to this would be really helpful...
>>> >>
>>> >> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche
>>> >> <lists.digitalpeb...@gmail.com> wrote:
>>> >> > Hi again
>>> >> >
>>> >> >
>>> >> >> i know the process is not stuck.. and the process is running because
>>> i
>>> >> >> turned on the hadoop logs and i can see logs being written to it...
>>> >> >> I'm not sure how to check if the task is completely stuck or not...
>>> >> >>
>>> >> >
>>> >> > run jps to identify the process id then *jstack id* several times to
>>> see
>>> >> if
>>> >> > it is blocked at the same place
>>> >> >
>>> >> > how much space do you have left on the partition where /tmp is
>>> mounted?
>>> >> >
>>> >> > J.
>>> >> >
>>> >> >
>>> >> >
>>> >> >> Below is the sample log as i'm sending this email.... Its been on
>>> the
>>> >> >> updatedb process for the last 19 days and the it has been generating
>>> >> >> debug logs similar to this........
>>> >> >>
>>> >> >> Has anyone else has this same issue before...
>>> >> >>
>>> >> >>
>>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
>>> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
>>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ
>>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE
>>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
>>> >> >> org.apache.hadoop.mapred.Task$Counter with bundle
>>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding
>>> >> >> COMBINE_OUTPUT_RECORDS
>>> >> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding
>>> MAP_INPUT_RECORDS
>>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
>>> MAP_OUTPUT_BYTES
>>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
>>> MAP_INPUT_BYTES
>>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
>>> >> MAP_OUTPUT_RECORDS
>>> >> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
>>> >> >> COMBINE_INPUT_RECORDS
>>> >> >> 2009-11-02 13:34:21,643 INFO  mapred.JobClient -  map 93% reduce 0%
>>> >> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - Spilling map output:
>>> >> >> record full = true
>>> >> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - bufstart = 10420198;
>>> >> >> bufend = 13893589; bufvoid = 99614720
>>> >> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - kvstart = 131070;
>>> kvend
>>> >> >> = 65533; length = 327680
>>> >> >> 2009-11-02 13:34:22,427 INFO  mapred.MapTask - Finished spill 3
>>> >> >> 2009-11-02 13:34:23,301 INFO  mapred.MapTask - Starting flush of map
>>> >> output
>>> >> >> 2009-11-02 13:34:23,384 INFO  mapred.MapTask - Finished spill 4
>>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
>>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228)
>>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
>>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246)
>>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
>>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246)
>>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
>>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246)
>>> >> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
>>> >> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246)
>>> >> >> 2009-11-02 13:34:23,390 INFO  mapred.Merger - Merging 5 sorted
>>> segments
>>> >> >> 2009-11-02 13:34:23,392 INFO  mapred.Merger - Down to the last
>>> >> >> merge-pass, with 5 segments left of total size: 1192 bytes
>>> >> >> 2009-11-02 13:34:23,393 INFO  mapred.MapTask - Index: (0, 354, 358)
>>> >> >> 2009-11-02 13:34:23,394 INFO  mapred.TaskRunner -
>>> >> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of
>>> >> >> commiting
>>> >> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner -
>>> >> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it
>>> >> >> got interrupted
>>> >> >> 2009-11-02 13:34:23,395 INFO  mapred.LocalJobRunner -
>>> >> >>
>>> >> >>
>>> >>
>>> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
>>> >> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
>>> >> >> org.apache.hadoop.mapred.Task$Counter with bundle
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
>>> >> >> COMBINE_OUTPUT_RECORDS
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
>>> MAP_INPUT_RECORDS
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
>>> MAP_OUTPUT_BYTES
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
>>> MAP_INPUT_BYTES
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
>>> >> MAP_OUTPUT_RECORDS
>>> >> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
>>> >> >> COMBINE_INPUT_RECORDS
>>> >> >> 2009-11-02 13:34:23,397 INFO  mapred.TaskRunner - Task
>>> >> >> 'attempt_local_0001_m_000003_0' done.
>>> >> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0
>>> 0:0
>>> >> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration -
>>> >> >> java.io.IOException: config(config)
>>> >> >>        at
>>> >> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192)
>>> >> >>        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139)
>>> >> >>        at
>>> >> >>
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
>>> >> >>
>>> >> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split
>>> to
>>> >> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta
>>> >> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner -
>>> >> >> attempt_local_0001_m_000004_0 Progress/ping thread started
>>> >> >> 2009-11-02 13:34:23,452 INFO  mapred.MapTask - numReduceTasks: 1
>>> >> >> 2009-11-02 13:34:23,453 INFO  mapred.MapTask - io.sort.mb = 100
>>> >> >> 2009-11-02 13:34:23,644 INFO  mapred.JobClient -  map 100% reduce 0%
>>> >> >>
>>> >> >> Mathan
>>> >> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <a...@getopt.org>
>>> wrote:
>>> >> >> > Kalaimathan Mahenthiran wrote:
>>> >> >> >>
>>> >> >> >> I forgot to add the detail...
>>> >> >> >>
>>> >> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls
>>> >> fetched
>>> >> >> >> and 1.08 million urls parsed..
>>> >> >> >>
>>> >> >> >> Any help related to this would be appreciated...
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
>>> >> >> >> <matha...@gmail.com> wrote:
>>> >> >> >>>
>>> >> >> >>> hi everyone
>>> >> >> >>>
>>> >> >> >>> I'm using nutch 1.0. I have fetched successfully and currently
>>> on
>>> >> the
>>> >> >> >>> updatedb process. I'm doing updatedb and its taking so long. I
>>> don't
>>> >> >> >>> know why its taking this long. I have a new machine with quad
>>> core
>>> >> >> >>> processor and 8 gb of ram.
>>> >> >> >>>
>>> >> >> >>> I believe this system is really good in terms of processing
>>> power. I
>>> >> >> >>> don't think processing power is the problem here. I noticed that
>>> all
>>> >> >> >>> the ram is getting using up. close to 7.7gb by the updatedb
>>> process.
>>> >> >> >>> The computer is becoming is really slow.
>>> >> >> >>>
>>> >> >> >>> The updatedb process has been running for the last 19 days
>>> >> continually
>>> >> >> >>> with the message merging segment data into db.. Does anyone know
>>> why
>>> >> >> >>> its taking so long... Is there any configuration setting i can
>>> do to
>>> >> >> >>> increase the speed of the updatedb process...
>>> >> >> >
>>> >> >> > First, this process normally takes just a few minutes, depending
>>> on
>>> >> the
>>> >> >> > hardware, and not several days - so something is wrong.
>>> >> >> >
>>> >> >> > * do you run this in "local" or pseudo-distributed mode (i.e.
>>> running
>>> >> a
>>> >> >> real
>>> >> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode,
>>> because
>>> >> >> then
>>> >> >> > you can monitor the progress in the web UI.
>>> >> >> >
>>> >> >> > * how many reduce tasks do you have? with large updates it helps
>>> if
>>> >> you
>>> >> >> run
>>> >> >> >> 1 reducer, to split the final sorting.
>>> >> >> >
>>> >> >> > * if the task appears to be completely stuck, please generate a
>>> thread
>>> >> >> dump
>>> >> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to
>>> >> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these
>>> are
>>> >> >> > problematic by removing them from the config and re-running the
>>> >> >> operation.
>>> >> >> >
>>> >> >> > * minor issue - when specifying the path names of segments and
>>> >> crawldb,
>>> >> >> do
>>> >> >> > NOT append the trailing slash - it's not harmful in this
>>> particular
>>> >> case,
>>> >> >> > but you could have a nasty surprise when doing e.g. copy / mv
>>> >> operations
>>> >> >> ...
>>> >> >> >
>>> >> >> > --
>>> >> >> > Best regards,
>>> >> >> > Andrzej Bialecki     <><
>>> >> >> >  ___. ___ ___ ___ _ _   __________________________________
>>> >> >> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> >> >> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> >> >> > http://www.sigram.com  Contact: info at sigram dot com
>>> >> >> >
>>> >> >> >
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > DigitalPebble Ltd
>>> >> > http://www.digitalpebble.com
>>> >> >
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > DigitalPebble Ltd
>>> > http://www.digitalpebble.com
>>> >
>>>
>>
>>
>>
>

Reply via email to