Re: description of db.ignore.internal.links property

Dennis Kubes Wed, 02 Apr 2008 07:41:10 -0700


Vineet Garg wrote:

Hi,

What does db.ignore.internal.links property in nutch-default.xml do?

<property>
 <name>db.ignore.internal.links</name>
 <value>true</value>
 <description>If true, when adding new links to a page, links from
 the same host are ignored.  This is an effective way to limit the
 size of the link database, keeping only the highest quality
 links.
 </description>
</property>

If true it will NOT store links in a domain that point to the samedomain. For example a link and page www.domain.com/a.html that pointsto www.domain.com/b.html. This significantly decreases the number oflinks being stored in the link database.

1. Does it effect the page rank by getting into account more pages whenit creates the page rank, or

Yes because by default internal links are scored the same as externallinks. For large web crawls this will throw off results because pageswith more internal links can get higher rankings. I have found that onlarger web crawls it is best to ignore internal links and to setdb.score.link.internal to a very low value or 0.

2. It effects indexing by indexing more pages and therefore returns moreresults when searching later on.

No it doesn't affect the links being stored in crawldb and laterfetched. It only affects linkdb and the eventual scoring process.


Dennis



Can anybody please explain it?


Regards,
Vineet Garg

Re: description of db.ignore.internal.links property

Reply via email to