Thanks for all the replies... Okay, I think there seems to be some issue too...
I'm running nutch out of the box.. using nutch release 1.0... I running this in "local" mode.. The number of reduce tasks.. is the default configured by nutch... The db size is approximately 860 mb.. i know the process is not stuck.. and the process is running because i turned on the hadoop logs and i can see logs being written to it... I'm not sure how to check if the task is completely stuck or not... Below is the sample log as i'm sending this email.... Its been on the updatedb process for the last 19 days and the it has been generating debug logs similar to this........ Has anyone else has this same issue before... 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$FileSystemCounter with bundle 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$Counter with bundle 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS 2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS 2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS 2009-11-02 13:34:21,643 INFO mapred.JobClient - map 93% reduce 0% 2009-11-02 13:34:22,121 INFO mapred.MapTask - Spilling map output: record full = true 2009-11-02 13:34:22,121 INFO mapred.MapTask - bufstart = 10420198; bufend = 13893589; bufvoid = 99614720 2009-11-02 13:34:22,121 INFO mapred.MapTask - kvstart = 131070; kvend = 65533; length = 327680 2009-11-02 13:34:22,427 INFO mapred.MapTask - Finished spill 3 2009-11-02 13:34:23,301 INFO mapred.MapTask - Starting flush of map output 2009-11-02 13:34:23,384 INFO mapred.MapTask - Finished spill 4 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246) 2009-11-02 13:34:23,390 DEBUG mapred.MapTask - MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246) 2009-11-02 13:34:23,390 INFO mapred.Merger - Merging 5 sorted segments 2009-11-02 13:34:23,392 INFO mapred.Merger - Down to the last merge-pass, with 5 segments left of total size: 1192 bytes 2009-11-02 13:34:23,393 INFO mapred.MapTask - Index: (0, 354, 358) 2009-11-02 13:34:23,394 INFO mapred.TaskRunner - Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting 2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner - attempt_local_0001_m_000003_0 Progress/ping thread exiting since it got interrupted 2009-11-02 13:34:23,395 INFO mapred.LocalJobRunner - file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$FileSystemCounter with bundle 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$Counter with bundle 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS 2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS 2009-11-02 13:34:23,397 INFO mapred.TaskRunner - Task 'attempt_local_0001_m_000003_0' done. 2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0 0:0 2009-11-02 13:34:23,397 DEBUG conf.Configuration - java.io.IOException: config(config) at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192) at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132) 2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to /tmp/hadoop-root/mapred/local/localRunner/split.dta 2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner - attempt_local_0001_m_000004_0 Progress/ping thread started 2009-11-02 13:34:23,452 INFO mapred.MapTask - numReduceTasks: 1 2009-11-02 13:34:23,453 INFO mapred.MapTask - io.sort.mb = 100 2009-11-02 13:34:23,644 INFO mapred.JobClient - map 100% reduce 0% Mathan On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <a...@getopt.org> wrote: > Kalaimathan Mahenthiran wrote: >> >> I forgot to add the detail... >> >> The segment i'm trying to do updatedb on has 1.3 millions urls fetched >> and 1.08 million urls parsed.. >> >> Any help related to this would be appreciated... >> >> >> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran >> <matha...@gmail.com> wrote: >>> >>> hi everyone >>> >>> I'm using nutch 1.0. I have fetched successfully and currently on the >>> updatedb process. I'm doing updatedb and its taking so long. I don't >>> know why its taking this long. I have a new machine with quad core >>> processor and 8 gb of ram. >>> >>> I believe this system is really good in terms of processing power. I >>> don't think processing power is the problem here. I noticed that all >>> the ram is getting using up. close to 7.7gb by the updatedb process. >>> The computer is becoming is really slow. >>> >>> The updatedb process has been running for the last 19 days continually >>> with the message merging segment data into db.. Does anyone know why >>> its taking so long... Is there any configuration setting i can do to >>> increase the speed of the updatedb process... > > First, this process normally takes just a few minutes, depending on the > hardware, and not several days - so something is wrong. > > * do you run this in "local" or pseudo-distributed mode (i.e. running a real > jobtracker and tasktracker?) Try the pseudo-distributed mode, because then > you can monitor the progress in the web UI. > > * how many reduce tasks do you have? with large updates it helps if you run >> 1 reducer, to split the final sorting. > > * if the task appears to be completely stuck, please generate a thread dump > (kill -SIGQUIT) and see where it's stuck. This could be related to > urlfilter-regex or urlnormalizer-regex - you can identify if these are > problematic by removing them from the config and re-running the operation. > > * minor issue - when specifying the path names of segments and crawldb, do > NOT append the trailing slash - it's not harmful in this particular case, > but you could have a nasty surprise when doing e.g. copy / mv operations ... > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >