Hi Doug,
I'm pretty sure that your problem is related to the deduping of your
index.
In general the hash of the content of a page is used as key for the
dedub tool.
We ran into the the forwarding problem also in a other case.
https://issues.apache.org/jira/browse/NUTCH-353
So may be we should think about a general solution of the forwarding
problem.
Greetings,
Stefan
Am 28.08.2006 um 11:33 schrieb Doug Cook:
Hi, folks,
I have just started digging into relevance issues with Nutch, and I'm
running into some mysteries. Before I dig too deep, I wanted to
check to see
if these were known issues (a quick search of the email archives
and of JIRA
didn't turn up anything). I'm running 0.8 with a handful of patches.
I'm frequently finding root pages of sites missing from my index,
despite
the fact that they have been fetched. In my admittedly short
investigation I
have found two classes of cases:
1. Root URL is not a redirect, but there is a root-level index.html
page.
The index.html page is in the index, but the root page is not.
Unfortunately, most of the anchor text points to the root page, not
the
/index.html page, and the anchor text has gone "missing" along with
its
associated page, so relevance is poor.
2. Root URL is a redirect to another page. Again, this other page
is in the
index, the but the root page, along with its anchor text, has gone
"missing."
I have a deduped index. Both of these cases could result from dedup
throwing
out the wrong URL, i.e. the one with more anchor text, although one
might
expect dedup to merge the two anchor texts (at least in the case of
pages
which commonly normalize to the same URL, e.g. / and /index.html).
The second case might result from the root URL somehow being
normalized to
its redirect target, but in that case (incorrect, in any case) I would
expect the anchor text to also be attached to the redirect target,
and it is
not.
I'm about to rebuild with no deduping and see what I find.
Thanks for your help & comments-
Doug
--
View this message in context: http://www.nabble.com/Missing-pages---
anchor-text-tf2179049.html#a6025652
Sent from the Nutch - Dev forum at Nabble.com.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
101tec Inc.
Menlo Park, California
http://www.101tec.com