Hi,
did you tried...
<property>
  <name>db.ignore.internal.links</name>
  <value>true</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping the only the highest quality
  links.
  </description>
</property>

... setting to false?

Stefan

Am 20.12.2005 um 00:49 schrieb David Wallace:

Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code.  I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.

I'm using Nutch for an "intranet style" search engine, on a single
site, so I don't really care about the uniqueness by domain. However, I
can't help thinking that the uniqueness by anchor text probably isn't
what I want.

Suppose my site has 3 pages with links to page X, and the same anchor
text. I'd kind of like to score page X higher than a page where there's only one incoming link with that anchor text. But I don't want to have
this effect swamping the other calculations of page score.  In other
words, if my site has 1000 pages with links to page X, this page should score a wee bit higher than a similar page with just one incoming link,
but not 1000 times higher.

I'm thinking of doing some maths with the number of repetitions of an
anchor, then including the result in the page score.  Something like
log(10+n), or maybe n/(n+2); where n is the number of incoming links
with the same anchor text.  Either of these formulas would make 1000
incoming links score roughly 3 times higher than a single incoming link,
which seems about right to me.

It looks to me like I'm going to have to make changes deep within the
Lucene page scoring stuff to do this, which I'm not really looking
forward to. I'd really welcome hearing if anybody has a better solution
to this general problem.  The exact maths isn't too critical.  What's
important is that for small values of n, the page score must increase as
n increases, but the overall effect must diminish as n gets really
large.

Thanks in advance,
David.

********************************************************************** ********** This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its network.

********************************************************************** **********

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply via email to