Kalaimathan Mahenthiran wrote:
I forgot to add the detail...
The segment i'm trying to do updatedb on has 1.3 millions urls fetched
and 1.08 million urls parsed..
Any help related to this would be appreciated...
On Sun, Nov 1, 2009 at 11:53 PM, Kalaimathan Mahenthiran
<matha...@gmail.com> wrote:
hi everyone
I'm using nutch 1.0. I have fetched successfully and currently on the
updatedb process. I'm doing updatedb and its taking so long. I don't
know why its taking this long. I have a new machine with quad core
processor and 8 gb of ram.
I believe this system is really good in terms of processing power. I
don't think processing power is the problem here. I noticed that all
the ram is getting using up. close to 7.7gb by the updatedb process.
The computer is becoming is really slow.
The updatedb process has been running for the last 19 days continually
with the message merging segment data into db.. Does anyone know why
its taking so long... Is there any configuration setting i can do to
increase the speed of the updatedb process...
First, this process normally takes just a few minutes, depending on the
hardware, and not several days - so something is wrong.
* do you run this in "local" or pseudo-distributed mode (i.e. running a
real jobtracker and tasktracker?) Try the pseudo-distributed mode,
because then you can monitor the progress in the web UI.
* how many reduce tasks do you have? with large updates it helps if you
run > 1 reducer, to split the final sorting.
* if the task appears to be completely stuck, please generate a thread
dump (kill -SIGQUIT) and see where it's stuck. This could be related to
urlfilter-regex or urlnormalizer-regex - you can identify if these are
problematic by removing them from the config and re-running the operation.
* minor issue - when specifying the path names of segments and crawldb,
do NOT append the trailing slash - it's not harmful in this particular
case, but you could have a nasty surprise when doing e.g. copy / mv
operations ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com