Lukas Vlcek wrote:
Hi,

I am using nutch0.8-dev. I have a small shell script for
generate/fetch/update cycle. I used generate command with -topN 500.
After crawling about 2000 pages I changed -topN to 3 (yes three pages
only) to see what pages are crawled.

I found that generate/fetch/update cycles are always crawling the same
three pages!
I would expect that it should crawl different pages in every cycle
(and we have more then 3 pages on intranet and I am sure I injected
enough link food).

Can anybody tell me what am I doing wrong?

This indeed sounds strange - looks like their information is not being updated in the db. What was the fetch interval for these pages? Could you run a readdb -dump before and after updatedb?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to