Vineet Garg wrote:
Hi, What does db.ignore.internal.links property in nutch-default.xml do? <property> <name>db.ignore.internal.links</name> <value>true</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property>
If true it will NOT store links in a domain that point to the same domain. For example a link and page www.domain.com/a.html that points to www.domain.com/b.html. This significantly decreases the number of links being stored in the link database.
1. Does it effect the page rank by getting into account more pages when it creates the page rank, or
Yes because by default internal links are scored the same as external links. For large web crawls this will throw off results because pages with more internal links can get higher rankings. I have found that on larger web crawls it is best to ignore internal links and to set db.score.link.internal to a very low value or 0.
2. It effects indexing by indexing more pages and therefore returns more results when searching later on.
No it doesn't affect the links being stored in crawldb and later fetched. It only affects linkdb and the eventual scoring process.
Dennis
Can anybody please explain it? Regards, Vineet Garg
