Hi all, I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. I'm using Nutch for an "intranet style" search engine, on a single site, so I don't really care about the uniqueness by domain. However, I can't help thinking that the uniqueness by anchor text probably isn't what I want. Suppose my site has 3 pages with links to page X, and the same anchor text. I'd kind of like to score page X higher than a page where there's only one incoming link with that anchor text. But I don't want to have this effect swamping the other calculations of page score. In other words, if my site has 1000 pages with links to page X, this page should score a wee bit higher than a similar page with just one incoming link, but not 1000 times higher. I'm thinking of doing some maths with the number of repetitions of an anchor, then including the result in the page score. Something like log(10+n), or maybe n/(n+2); where n is the number of incoming links with the same anchor text. Either of these formulas would make 1000 incoming links score roughly 3 times higher than a single incoming link, which seems about right to me. It looks to me like I'm going to have to make changes deep within the Lucene page scoring stuff to do this, which I'm not really looking forward to. I'd really welcome hearing if anybody has a better solution to this general problem. The exact maths isn't too critical. What's important is that for small values of n, the page score must increase as n increases, but the overall effect must diminish as n gets really large. Thanks in advance, David.
******************************************************************************** This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network. ********************************************************************************
