Hi. I'm interested in these questions as well. For me, the main question is how to focus crawling on a given topic while using the entire web. If it's hard-focused to a set of sites I'll miss too much new content. If it's open to the web it can be hard to maintain scope. Thanks, Joe
-----Original Message----- From: Suhail Ahmed [mailto:[EMAIL PROTECTED] Sent: Saturday, May 14, 2005 5:39 AM To: [email protected] Subject: topic based crawling Hi, I am trying to figure out how I can user Nutch to build something like news.google.com but only to monitor news about countries on the State Departments "state sponsored terrorism" list (http:// www.state.gov/s/ct/rls/pgtrpt/2003/31644.htm)". I have a URL file with some 2000 online newspapers. I would like to confirm the veracity of my approach which I feel is probably wrong. My first fetch gets the home pages of the newspapers. I have a modified org.apache.nutch.parse.html.HtmlParser to store only those outlinks which contains a simple list of nouns related to the topic I am interested in. Is this right? I am assuming here that by doing so, the second fetch I perform ends up fetching the actual stories related to the links from the home page. It "sort of" works. I say sort of because unlike news.google, performing a search on say "North Korea" returns both home pages and sometimes, the article page itself where news.google just displays a hyperlink list to the actual news article. How would I get nutch search to return the results of only the second crawl and not the first one? Naturally the second problem is one of categorizing the actual content. Which parts of Nutch or Lucene do I have to work with to categorize (analyze?) the results of the second fetch? The third bit to how do I determine the timestamp on the content fetched so I can display the time of publication as news.google does. I promise to write up any help provided on the Nutch Wiki so others will know how as well. Thanks a lot. Suhail
