PageRank is a Trademark of Google, and a source of great revenue for them -
you'll have to call it something else. :(
Determining whether a page is relevant to a topic (with any degree of
accuracy) is a harder problem that it may appear - though your opening post
says to assume you have a way to do it. For example, suppose the word
"buffalo" appears on a page. Does this mean the animal, or the NFL team,
the city, or the spicy sauce?
Another concern is the assumption that documents linked-to by relevant
documents are themselves at all relevant. Take Wikipedia for example -
there are lots of links on every page that have nothing to do with the
article (such as Main Page, Community Portal, Privacy Policy, etc). If N is
any more than 1 or 2, you'll probably be swamped with non-relevant pages.
In researching the problem, you might want to check out the Carrot
Clustering Engine (http://demo.carrot-search.com/carrot2-webapp/main). It
may do what you want OOTB.
-- Jim
On 10/3/06, Apache Lucene <[EMAIL PROTECTED]> wrote:
I would like to setup a focussed crawler using Nutch. Assuming I have a
way
to detect which page is relevant to the topic under consideration what is
the best architecture? Here are the constraints for the crawler.
(1) The first crawl should result in links/pages which are related to the
domain.
(2) The non-relevant documents in the database should be within N hops
from
the relevant document. The assumption is a non-relevant document closer to
the relevant node in the web graph might contain links to other relevant
documents.
(1) should be fairly straight forward however I am not sure what it
involves
to implement (2). I am also not sure how the PageRank will work in case of
a
focussed crawler.
Any comments? suggestions?
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general