Re: Missing pages & anchor text

Doug Cook Thu, 31 Aug 2006 08:04:13 -0700

I'm thinking I should file issues on the following-

1. The scoring bug. Not sure what to file here, since such things are hard
to pin down. But defining an "inversion" as
        score(hostname/(index|default|home).(html|jsp|asp|cfm|etc)) >
score(hostname)
on a ~2.5Mdoc database, where I have about 8100 such pairs, 6558 were
inversions and only 1585 were "okay." Is this likely to a correct behavior
for OPIC scores? Is this a likely manifestation of a known bug? It doesn't
seem correct, but then, it's early and I still need more coffee ;-) In any
case, this causes the "wrong" versions of the pages to be selected most of
the time during dedup, and I've lost >6500 of the most important, most
anchor-text-rich pages, in my index -- a significant relevance issue.


2. When "duplicates" really refer to the same page (e.g. X/ vs.
X/index.html) , entries should be merged. Really, these are just
after-the-fact normalizations, but they are a class of normalizations which
can't be done without comparing page fingerprints, since they are not true
for all web servers.

3. Redirects. The index keeps the redirect target, but marks the source as
unfetched. This is unfortunate behavior, at least for the class of redirects
where www.x.com redirects to www.x.com/y, which, like the above combination
of issues, causes the root pages, and thus much of the important anchor
text, to be dropped from the index. This seems related to, if not the same
as, NUTCH-273 (https://issues.apache.org/jira/browse/NUTCH-273). I was
simply planning to add these comments to that issue, unless someone hollers.

Any comments or thoughts before I file the above issues?

For all of the cases where we ignore/drop pages, we should think about what
happens to the inbound anchor text. We should work very very hard to keep
all the anchor text we have, it's by far the most important page feature for
relevance.

-doug


Doug Cook wrote:
> 
> Hi Stefan,
> 
> Yes, you're right. The index built without deduping does not have the
> first instance of the problem (though of course, it's also filled with
> duplicates, so it has other problems). It still shows the problems with
> missing redirects, though this could be something else (will investigate
> that next). 
> 
> A little digging has turned up more information:
> 
> 1) Dedup throws away content matches, and decides which one to pick based
> upon score. This leads it to dump the wrong page, because:
> 
> http://www.x.com/
>     score: 1.2
> http://www.x.com/index.html
>     score: 1.8
> 
> I see two problems.
> 
> First, there is clearly a scoring problem (possibly my fault somehow;
> could this have resulted from my failing to build the index properly?).
> The root page actually has 9 inlinks; the index.html page has none. I
> can't see anything that would warrant the index.html getting a higher
> score, even were these actually different pages. Seems like this could be
> related to the problems you've already discovered. One (perhaps just short
> term?) possibility would be to use the inbound linkcount for deciding
> which page becomes the "canonical" version of a duplicate set, since this
> is probably more stable than the scores.
> 
> Second, these are in fact the same page. Regardless of which page "wins"
> by score, dedup should actually merge the two entries since this is a safe
> normalization, given that we know the content fingerprints are the same.
> The anchor texts and the scores should be combined. We can't necessarily
> do this for the general dedup case -- a page shouldn't necessarily benefit
> just because there are multiple copies of it -- though even there we may
> be able to combine some anchor text. But in this case these are not
> multiple copies; they are the same page.
> 
> In any case, we should work hard not to lose anchor text unless it is
> completely justified (e.g. for spam). For relevance purposes, anchor text
> is more important than any other page feature, score included. And
> especially in our world of small, focused crawls, it is a precious, scarce
> resource.
> 
> Thoughts? Comments?
> 
> -Doug
> 
> 
> Stefan Groschupf-2 wrote:
>> 
>> Hi Doug,
>> I'm pretty sure that your problem is related to the deduping of your  
>> index.
>> In general the hash of the content of a page is used as key for the  
>> dedub tool.
>> We ran into the the forwarding problem also in a other case.
>> https://issues.apache.org/jira/browse/NUTCH-353
>> So may be we should think about a general solution of the forwarding  
>> problem.
>> 
>> Greetings,
>> Stefan
>> 
>> 
>> Am 28.08.2006 um 11:33 schrieb Doug Cook:
>> 
>>>
>>> Hi, folks,
>>>
>>> I have just started digging into relevance issues with Nutch, and I'm
>>> running into some mysteries. Before I dig too deep, I wanted to  
>>> check to see
>>> if these were known issues (a quick search of the email archives  
>>> and of JIRA
>>> didn't turn up anything). I'm running 0.8 with a handful of patches.
>>>
>>> I'm frequently finding root pages of sites missing from my index,  
>>> despite
>>> the fact that they have been fetched. In my admittedly short  
>>> investigation I
>>> have found two classes of cases:
>>>
>>> 1. Root URL is not a redirect, but there is a root-level index.html  
>>> page.
>>> The index.html page is in the index, but the root page is not.
>>> Unfortunately, most of the anchor text points to the root page, not  
>>> the
>>> /index.html page, and the anchor text has gone "missing" along with  
>>> its
>>> associated page, so relevance is poor.
>>>
>>> 2. Root URL is a redirect to another page. Again, this other page  
>>> is in the
>>> index, the but the root page, along with its anchor text, has gone
>>> "missing."
>>>
>>> I have a deduped index. Both of these cases could result from dedup  
>>> throwing
>>> out the wrong URL, i.e. the one with more anchor text, although one  
>>> might
>>> expect dedup to merge the two anchor texts (at least in the case of  
>>> pages
>>> which commonly normalize to the same URL, e.g. / and /index.html).
>>>
>>> The second case might result from the root URL somehow being  
>>> normalized to
>>> its redirect target, but in that case (incorrect, in any case) I would
>>> expect the anchor text to also be attached to the redirect target,  
>>> and it is
>>> not.
>>>
>>> I'm about to rebuild with no deduping and see what I find.
>>>
>>> Thanks for your help & comments-
>>>
>>> Doug
>>> -- 
>>> View this message in context: http://www.nabble.com/Missing-pages--- 
>>> anchor-text-tf2179049.html#a6025652
>>> Sent from the Nutch - Dev forum at Nabble.com.
>>>
>>>
>> 
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> 101tec Inc.
>> Menlo Park, California
>> http://www.101tec.com
>> 
>> 
>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Missing-pages---anchor-text-tf2179049.html#a6081250
Sent from the Nutch - Dev forum at Nabble.com.

Re: Missing pages & anchor text

Reply via email to