OK. What heapsize did you specify for this job? Could it be that you are
running out of ram and GCing a lot? Still it should not take THAT long
Can you see some variations in the stacktraces or are they always pointing
at the same things?
The operations on the metadata take an awful lot of time, which I why I did
NUTCH-702, however that does not explain why processing a dataset this size
takes 20 days.

J.

2009/11/3 Kalaimathan Mahenthiran <matha...@gmail.com>

> I have lot of space left on the /tmp . I don't have separate partition
> for /tmp... i have a folder called /tmp... There is lot of space
> left.. close to 1.3Terabytes...
>
>                      1.4T   55G  1.3T   5% /
> tmpfs                 3.8G     0  3.8G   0% /lib/init/rw
> varrun                3.8G  120K  3.8G   1% /var/run
> varlock               3.8G     0  3.8G   0% /var/lock
> udev                  3.8G  152K  3.8G   1% /dev
> tmpfs                 3.8G     0  3.8G   0% /dev/shm
> lrm                   3.8G  2.5M  3.8G   1%
> /lib/modules/2.6.28-15-server/volatile
> /dev/sda5             228M   29M  187M  14% /boot
> /dev/sr0              388K  388K     0 100% /media/cdrom0
>
> I also noticed that /tmp/hadoop-root directory is 6.8 Gb...
>
> I have attached the jstack of the process that is doing the update....
> below
>
> 2009-11-02 19:11:54
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode):
>
> "Attach Listener" daemon prio=10 tid=0x0000000041bb1000 nid=0xd3b
> waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Comm thread for attempt_local_0001_r_000000_0" daemon prio=10
> tid=0x00007f3ff4002800 nid=0x6b8f waiting on condition
> [0x00007f4000e97000]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at org.apache.hadoop.mapred.Task$1.run(Task.java:403)
>        at java.lang.Thread.run(Thread.java:619)
>
> "Thread-12" prio=10 tid=0x0000000041b37800 nid=0x25f3 runnable
> [0x00007f4000f98000]
>   java.lang.Thread.State: RUNNABLE
>        at java.lang.Byte.hashCode(Byte.java:394)
>        at
> java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:882)
>        at
> org.apache.hadoop.io.AbstractMapWritable.addToMap(AbstractMapWritable.java:78)
>        - locked <0x00007f47ef4d9310> (a org.apache.hadoop.io.MapWritable)
>        at
> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:128)
>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42)
>        at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:52)
>        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:73)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
>
> "Low Memory Detector" daemon prio=10 tid=0x00007f3ffc004000 nid=0x25d0
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x00007f3ffc001000 nid=0x25cf
> waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x00000000417be800 nid=0x25ce
> waiting on condition [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x00000000417bc800 nid=0x25cd
> runnable [0x0000000000000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x000000004179e000 nid=0x25cc in
> Object.wait() [0x00007f40016f7000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00007f400f63e6c0> (a
> java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>        - locked <0x00007f400f63e6c0> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x0000000041797000 nid=0x25cb
> in Object.wait() [0x00007f40017f8000]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        - waiting on <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock)
>        at java.lang.Object.wait(Object.java:485)
>        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>        - locked <0x00007f400f63e6f8> (a java.lang.ref.Reference$Lock)
>
> "main" prio=10 tid=0x0000000041734000 nid=0x25c5 waiting on condition
> [0x00007f49d75c2000]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1152)
>        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:94)
>        at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:189)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:150)
>
> "VM Thread" prio=10 tid=0x0000000041790000 nid=0x25ca runnable
>
> "GC task thread#0 (ParallelGC)" prio=10 tid=0x000000004173e000
> nid=0x25c6 runnable
>
> "GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000041740000
> nid=0x25c7 runnable
>
> "GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000041742000
> nid=0x25c8 runnable
>
> "GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000041744000
> nid=0x25c9 runnable
>
> "VM Periodic Task Thread" prio=10 tid=0x00007f3ffc006800 nid=0x25d1
> waiting on condition
>
> JNI global references: 907
>
>
>
> Any help related to this would be really helpful...
>
> On Mon, Nov 2, 2009 at 3:56 PM, Julien Nioche
> <lists.digitalpeb...@gmail.com> wrote:
> > Hi again
> >
> >
> >> i know the process is not stuck.. and the process is running because i
> >> turned on the hadoop logs and i can see logs being written to it...
> >> I'm not sure how to check if the task is completely stuck or not...
> >>
> >
> > run jps to identify the process id then *jstack id* several times to see
> if
> > it is blocked at the same place
> >
> > how much space do you have left on the partition where /tmp is mounted?
> >
> > J.
> >
> >
> >
> >> Below is the sample log as i'm sending this email.... Its been on the
> >> updatedb process for the last 19 days and the it has been generating
> >> debug logs similar to this........
> >>
> >> Has anyone else has this same issue before...
> >>
> >>
> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ
> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE
> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
> >> org.apache.hadoop.mapred.Task$Counter with bundle
> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding
> >> COMBINE_OUTPUT_RECORDS
> >> 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
> MAP_OUTPUT_RECORDS
> >> 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding
> >> COMBINE_INPUT_RECORDS
> >> 2009-11-02 13:34:21,643 INFO  mapred.JobClient -  map 93% reduce 0%
> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - Spilling map output:
> >> record full = true
> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - bufstart = 10420198;
> >> bufend = 13893589; bufvoid = 99614720
> >> 2009-11-02 13:34:22,121 INFO  mapred.MapTask - kvstart = 131070; kvend
> >> = 65533; length = 327680
> >> 2009-11-02 13:34:22,427 INFO  mapred.MapTask - Finished spill 3
> >> 2009-11-02 13:34:23,301 INFO  mapred.MapTask - Starting flush of map
> output
> >> 2009-11-02 13:34:23,384 INFO  mapred.MapTask - Finished spill 4
> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228)
> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246)
> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246)
> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246)
> >> 2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
> >> MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246)
> >> 2009-11-02 13:34:23,390 INFO  mapred.Merger - Merging 5 sorted segments
> >> 2009-11-02 13:34:23,392 INFO  mapred.Merger - Down to the last
> >> merge-pass, with 5 segments left of total size: 1192 bytes
> >> 2009-11-02 13:34:23,393 INFO  mapred.MapTask - Index: (0, 354, 358)
> >> 2009-11-02 13:34:23,394 INFO  mapred.TaskRunner -
> >> Task:attempt_local_0001_m_000003_0 is done. And is in the process of
> >> commiting
> >> 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner -
> >> attempt_local_0001_m_000003_0 Progress/ping thread exiting since it
> >> got interrupted
> >> 2009-11-02 13:34:23,395 INFO  mapred.LocalJobRunner -
> >>
> >>
> file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
> >> org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
> >> org.apache.hadoop.mapred.Task$Counter with bundle
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> >> COMBINE_OUTPUT_RECORDS
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> MAP_OUTPUT_RECORDS
> >> 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding
> >> COMBINE_INPUT_RECORDS
> >> 2009-11-02 13:34:23,397 INFO  mapred.TaskRunner - Task
> >> 'attempt_local_0001_m_000003_0' done.
> >> 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0   0:0
> >> 2009-11-02 13:34:23,397 DEBUG conf.Configuration -
> >> java.io.IOException: config(config)
> >>        at
> >> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192)
> >>        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139)
> >>        at
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
> >>
> >> 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to
> >> /tmp/hadoop-root/mapred/local/localRunner/split.dta
> >> 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner -
> >> attempt_local_0001_m_000004_0 Progress/ping thread started
> >> 2009-11-02 13:34:23,452 INFO  mapred.MapTask - numReduceTasks: 1
> >> 2009-11-02 13:34:23,453 INFO  mapred.MapTask - io.sort.mb = 100
> >> 2009-11-02 13:34:23,644 INFO  mapred.JobClient -  map 100% reduce 0%
> >>
> >> Mathan
> >> On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <a...@getopt.org> wrote:
> >> > Kalaimathan Mahenthiran wrote:
> >> >>
> >> >> I forgot to add the detail...
> >> >>
> >> >> The segment i'm trying to do updatedb on has 1.3 millions urls
> fetched
> >> >> and 1.08 million urls parsed..
> >> >>
> >> >> Any help related to this would be appreciated...
> >> >>
> >> >>
> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
> >> >> <matha...@gmail.com> wrote:
> >> >>>
> >> >>> hi everyone
> >> >>>
> >> >>> I'm using nutch 1.0. I have fetched successfully and currently on
> the
> >> >>> updatedb process. I'm doing updatedb and its taking so long. I don't
> >> >>> know why its taking this long. I have a new machine with quad core
> >> >>> processor and 8 gb of ram.
> >> >>>
> >> >>> I believe this system is really good in terms of processing power. I
> >> >>> don't think processing power is the problem here. I noticed that all
> >> >>> the ram is getting using up. close to 7.7gb by the updatedb process.
> >> >>> The computer is becoming is really slow.
> >> >>>
> >> >>> The updatedb process has been running for the last 19 days
> continually
> >> >>> with the message merging segment data into db.. Does anyone know why
> >> >>> its taking so long... Is there any configuration setting i can do to
> >> >>> increase the speed of the updatedb process...
> >> >
> >> > First, this process normally takes just a few minutes, depending on
> the
> >> > hardware, and not several days - so something is wrong.
> >> >
> >> > * do you run this in "local" or pseudo-distributed mode (i.e. running
> a
> >> real
> >> > jobtracker and tasktracker?) Try the pseudo-distributed mode, because
> >> then
> >> > you can monitor the progress in the web UI.
> >> >
> >> > * how many reduce tasks do you have? with large updates it helps if
> you
> >> run
> >> >> 1 reducer, to split the final sorting.
> >> >
> >> > * if the task appears to be completely stuck, please generate a thread
> >> dump
> >> > (kill -SIGQUIT) and see where it's stuck. This could be related to
> >> > urlfilter-regex or urlnormalizer-regex - you can identify if these are
> >> > problematic by removing them from the config and re-running the
> >> operation.
> >> >
> >> > * minor issue - when specifying the path names of segments and
> crawldb,
> >> do
> >> > NOT append the trailing slash - it's not harmful in this particular
> case,
> >> > but you could have a nasty surprise when doing e.g. copy / mv
> operations
> >> ...
> >> >
> >> > --
> >> > Best regards,
> >> > Andrzej Bialecki     <><
> >> >  ___. ___ ___ ___ _ _   __________________________________
> >> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> > http://www.sigram.com  Contact: info at sigram dot com
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > DigitalPebble Ltd
> > http://www.digitalpebble.com
> >
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to