Annona Keene wrote:
I am experiencing a very strange phenomenon.

I have a small site -- only 7 pages -- and I am giving a single page
in the seed list (the front page of the site). With a depth of 10
(which is just my default, it's overkill here), I get all of the
pages except the front page. I can see in the logs that the index
page is being fetched, and there are no errors. But it's never passed
on to the indexer. I've got debug turned on in the logs, and I can
find nothing unusual, except this page is just vanishing.

If I give the list of 7 pages explicitly and crawl with a depth of 1,
I get all of the pages including the index page. If I give the list
of 7 pages explicitly and crawl with a depth of 2, I don't get the
index page but I do get all the rest.

What on earth is going on here? I do not have topN set or any other
strange settings.  Obviously I could just provide the url list and
crawl at a depth of 1, but I really don't want to do that. I can't be
certain I'll know if new pages are added, and I don't want to miss
them just because they aren't in my seed list.

Has anyone ever seen something like this before? I'm eager for any
help someone might be able to offer. >

I suspect the index page has a redirect - if it's a so called temporary redirect, then Nutch will ignore this page. Could you dump the content of the crawldb (readdb db -dump) and see what is the status of this url? (this assumes that you did an updatedb after crawling).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to