Thank you Stefan, for your speedy response.
 
I have indeed changed that setting to false.  However, that doesn't
deal with my problem.  The offending method is getAnchors in
org.apache.nutch.db.WebDBAnchors, which is called from
org.apache.nutch.tools.FetchListTool.  This method makes the array of
anchors unique, for the FetchListEntry (unless of course, the incoming
links are from different domains); and does so regardless of any
NutchConf setting.
 
If I changed the WebDBAnchors class, in order to disable this
uniqueness; I'd then need to incorporate some kind of numerical fudging
into the scoring.  This is to prevent the scores being badly skewed in
the cases where I have a page with a large number of incoming links, all
with the same anchor text.  This is likely to occur for pages that have
links in my site's navigation chrome, for example.
 
I suspect I shall have to bite the bullet, and start studying Lucene's
internal mathematics.
 
Regards,
David.
 
Stefan Groschupf wrote:

Hi,
did you tried...
<property>
   <name>db.ignore.internal.links</name>
   <value>true</value>
   <description>If true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping the only the highest quality
   links.
   </description>
</property>

... setting to false?

Stefan

Am 20.12.2005 um 00:49 schrieb David Wallace:

> Hi all,
> I've been grubbing around with Nutch for a while now, although I'm
> still working with 0.7 code.  I notice that when anchors are
collected
> for a document, they're made unique by domain and by anchor text.

[ some snipped ]


********************************************************************************
This email may contain legally privileged information and is intended only for 
the addressee. It is not necessarily the official view or 
communication of the New Zealand Qualifications Authority. If you are not the 
intended recipient you must not use, disclose, copy or distribute this email or 
information in it. If you have received this email in error, please contact the 
sender immediately. NZQA does not accept any liability for changes made to this 
email or attachments after sending by NZQA. 

All emails have been scanned for viruses and content by MailMarshal. 
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************

Reply via email to