Thanks for all the replies...

Okay, I think there seems to be some issue too...

I'm running nutch out of the box.. using nutch release 1.0... I
running this in "local" mode..

The number of reduce tasks.. is the default configured by nutch...

The db size is approximately 860 mb..

i know the process is not stuck.. and the process is running because i
turned on the hadoop logs and i can see logs being written to it...
I'm not sure how to check if the task is completely stuck or not...

Below is the sample log as i'm sending this email.... Its been on the
updatedb process for the last 19 days and the it has been generating
debug logs similar to this........

Has anyone else has this same issue before...


2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_READ
2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding LOCAL_WRITE
2009-11-02 13:34:21,112 DEBUG mapred.Counters - Creating group
org.apache.hadoop.mapred.Task$Counter with bundle
2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS
2009-11-02 13:34:21,112 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
2009-11-02 13:34:21,113 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS
2009-11-02 13:34:21,643 INFO  mapred.JobClient -  map 93% reduce 0%
2009-11-02 13:34:22,121 INFO  mapred.MapTask - Spilling map output:
record full = true
2009-11-02 13:34:22,121 INFO  mapred.MapTask - bufstart = 10420198;
bufend = 13893589; bufvoid = 99614720
2009-11-02 13:34:22,121 INFO  mapred.MapTask - kvstart = 131070; kvend
= 65533; length = 327680
2009-11-02 13:34:22,427 INFO  mapred.MapTask - Finished spill 3
2009-11-02 13:34:23,301 INFO  mapred.MapTask - Starting flush of map output
2009-11-02 13:34:23,384 INFO  mapred.MapTask - Finished spill 4
2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =0(0,224, 228)
2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =1(0,242, 246)
2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =2(0,242, 246)
2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =3(0,242, 246)
2009-11-02 13:34:23,390 DEBUG mapred.MapTask -
MapId=attempt_local_0001_m_000003_0 Reducer=0Spill =4(0,242, 246)
2009-11-02 13:34:23,390 INFO  mapred.Merger - Merging 5 sorted segments
2009-11-02 13:34:23,392 INFO  mapred.Merger - Down to the last
merge-pass, with 5 segments left of total size: 1192 bytes
2009-11-02 13:34:23,393 INFO  mapred.MapTask - Index: (0, 354, 358)
2009-11-02 13:34:23,394 INFO  mapred.TaskRunner -
Task:attempt_local_0001_m_000003_0 is done. And is in the process of
commiting
2009-11-02 13:34:23,395 DEBUG mapred.TaskRunner -
attempt_local_0001_m_000003_0 Progress/ping thread exiting since it
got interrupted
2009-11-02 13:34:23,395 INFO  mapred.LocalJobRunner -
file:/opt/tsweb/nutch-1.0/newHyperseekCrawl/db/current/part-00000/data:100663296+33554432
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
org.apache.hadoop.mapred.Task$FileSystemCounter with bundle
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_READ
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding LOCAL_WRITE
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Creating group
org.apache.hadoop.mapred.Task$Counter with bundle
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
2009-11-02 13:34:23,396 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS
2009-11-02 13:34:23,397 INFO  mapred.TaskRunner - Task
'attempt_local_0001_m_000003_0' done.
2009-11-02 13:34:23,397 DEBUG mapred.SortedRanges - currentIndex 0   0:0
2009-11-02 13:34:23,397 DEBUG conf.Configuration -
java.io.IOException: config(config)
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:192)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:139)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)

2009-11-02 13:34:23,398 DEBUG mapred.MapTask - Writing local split to
/tmp/hadoop-root/mapred/local/localRunner/split.dta
2009-11-02 13:34:23,451 DEBUG mapred.TaskRunner -
attempt_local_0001_m_000004_0 Progress/ping thread started
2009-11-02 13:34:23,452 INFO  mapred.MapTask - numReduceTasks: 1
2009-11-02 13:34:23,453 INFO  mapred.MapTask - io.sort.mb = 100
2009-11-02 13:34:23,644 INFO  mapred.JobClient -  map 100% reduce 0%

Mathan
On Mon, Nov 2, 2009 at 4:11 AM, Andrzej Bialecki <a...@getopt.org> wrote:
> Kalaimathan Mahenthiran wrote:
>>
>> I forgot to add the detail...
>>
>> The segment i'm trying to do updatedb on has 1.3 millions urls fetched
>> and 1.08 million urls parsed..
>>
>> Any help related to this would be appreciated...
>>
>>
>> On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
>> <matha...@gmail.com> wrote:
>>>
>>> hi everyone
>>>
>>> I'm using nutch 1.0. I have fetched successfully and currently on the
>>> updatedb process. I'm doing updatedb and its taking so long. I don't
>>> know why its taking this long. I have a new machine with quad core
>>> processor and 8 gb of ram.
>>>
>>> I believe this system is really good in terms of processing power. I
>>> don't think processing power is the problem here. I noticed that all
>>> the ram is getting using up. close to 7.7gb by the updatedb process.
>>> The computer is becoming is really slow.
>>>
>>> The updatedb process has been running for the last 19 days continually
>>> with the message merging segment data into db.. Does anyone know why
>>> its taking so long... Is there any configuration setting i can do to
>>> increase the speed of the updatedb process...
>
> First, this process normally takes just a few minutes, depending on the
> hardware, and not several days - so something is wrong.
>
> * do you run this in "local" or pseudo-distributed mode (i.e. running a real
> jobtracker and tasktracker?) Try the pseudo-distributed mode, because then
> you can monitor the progress in the web UI.
>
> * how many reduce tasks do you have? with large updates it helps if you run
>> 1 reducer, to split the final sorting.
>
> * if the task appears to be completely stuck, please generate a thread dump
> (kill -SIGQUIT) and see where it's stuck. This could be related to
> urlfilter-regex or urlnormalizer-regex - you can identify if these are
> problematic by removing them from the config and re-running the operation.
>
> * minor issue - when specifying the path names of segments and crawldb, do
> NOT append the trailing slash - it's not harmful in this particular case,
> but you could have a nasty surprise when doing e.g. copy / mv operations ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to