Thank you Stefan, for your speedy response. I have indeed changed that setting to false. However, that doesn't deal with my problem. The offending method is getAnchors in org.apache.nutch.db.WebDBAnchors, which is called from org.apache.nutch.tools.FetchListTool. This method makes the array of anchors unique, for the FetchListEntry (unless of course, the incoming links are from different domains); and does so regardless of any NutchConf setting. If I changed the WebDBAnchors class, in order to disable this uniqueness; I'd then need to incorporate some kind of numerical fudging into the scoring. This is to prevent the scores being badly skewed in the cases where I have a page with a large number of incoming links, all with the same anchor text. This is likely to occur for pages that have links in my site's navigation chrome, for example. I suspect I shall have to bite the bullet, and start studying Lucene's internal mathematics. Regards, David. Stefan Groschupf wrote:
Hi, did you tried... <property> <name>db.ignore.internal.links</name> <value>true</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping the only the highest quality links. </description> </property> ... setting to false? Stefan Am 20.12.2005 um 00:49 schrieb David Wallace: > Hi all, > I've been grubbing around with Nutch for a while now, although I'm > still working with 0.7 code. I notice that when anchors are collected > for a document, they're made unique by domain and by anchor text. [ some snipped ] ******************************************************************************** This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA. All emails have been scanned for viruses and content by MailMarshal. NZQA reserves the right to monitor all email communications through its network. ********************************************************************************
