Lukas Vlcek wrote:
Hi,
I am using nutch0.8-dev. I have a small shell script for
generate/fetch/update cycle. I used generate command with -topN 500.
After crawling about 2000 pages I changed -topN to 3 (yes three pages
only) to see what pages are crawled.
I found that generate/fetch/update cycles are always crawling the same
three pages!
I would expect that it should crawl different pages in every cycle
(and we have more then 3 pages on intranet and I am sure I injected
enough link food).
Can anybody tell me what am I doing wrong?
This indeed sounds strange - looks like their information is not being
updated in the db. What was the fetch interval for these pages? Could
you run a readdb -dump before and after updatedb?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com