Thanks for your support Dennis :) I've been told that db.max.outlinks.per.page is precisely what we want, so there's no need for the other option right now.
Roman On Thu, Jul 3, 2008 at 12:13 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > The db.max.outlinks.per.page defines how many links are grabbed from a given > url. There isn't a way to grab by domain, including subs, without writing a > custom tool. > > I am almost finished with a new scoring framework that creates a webgraphdb. > Part of that process is storing outlinks. It stores only the url though, > not the anchor text. It would be fairly simple to write a tool that takes > that output and processes for domains and subdomains. Let me know if you > want a current patch and I will send it to you. > > Dennis > > brainstorm wrote: >> >> Hi, >> >> Is there a configurable setting to limit the number of links to fetch >> for every domain in nutch ? I'm not referring to the topN setting >> which sorts a fetchlist using lucene (or nutch) scoring mechanism. >> >> In other words, I just want to fetch 1500 links from every >> "whatever.com" domain *including* all their subdomains. >> >> Ex: >> >> upc.edu : 1500 links hard limit counter >> escert.upc.edu: *new* 1500 links hard limit counter >> ac.upc.edu: *new* 1500 links hard limit counter >> >> *or* shared links hard limit per domain: >> >> *.upc.edu: 1500 links hard limit >> >> Thanks in advance ! >
