Dennis Kubes
Wed, 02 Jul 2008 15:13:47 -0700
I am almost finished with a new scoring framework that creates a webgraphdb. Part of that process is storing outlinks. It stores only the url though, not the anchor text. It would be fairly simple to write a tool that takes that output and processes for domains and subdomains. Let me know if you want a current patch and I will send it to you.
Dennis brainstorm wrote:
Hi, Is there a configurable setting to limit the number of links to fetch for every domain in nutch ? I'm not referring to the topN setting which sorts a fetchlist using lucene (or nutch) scoring mechanism. In other words, I just want to fetch 1500 links from every "whatever.com" domain *including* all their subdomains. Ex: upc.edu : 1500 links hard limit counter escert.upc.edu: *new* 1500 links hard limit counter ac.upc.edu: *new* 1500 links hard limit counter *or* shared links hard limit per domain: *.upc.edu: 1500 links hard limit Thanks in advance !