As Erik hinted, Nutch uses this information. It keeps has a database of links that point to each page, indexes their anchor text along with the page, and boosts highly linked pages more than lesser linked pages.
Doug
Chris Fraschetti wrote:
My lucene implementation works great, its basically an index of many web crawls. The main thing my users complain about is say a search for "slashdot" will return the http://www.slashdot.org/soem_dir/somepage.asp as the top result because the factors i have scoring it determine it as so... but obviously in true search engine fashion.. i would like http://www.slashdot.org/ to be the very top result... i've added a boost to queries that match the hostname field, which helped a little, but obviously not a proper solution. Does anyone out there in the search engine world have a good schema for determining root websites and applying a huge boost to them in one fashion or another? mainly so it appears before any sub pages? (assuming the query is in reference to that site) ...
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]