The db.max.outlinks.per.page defines how many links are grabbed from a
given url. There isn't a way to grab by domain, including subs, without
writing a custom tool.
I am almost finished with a new scoring framework that creates a
webgraphdb. Part of that process is storing outlinks. It stores only
the url though, not the anchor text. It would be fairly simple to write
a tool that takes that output and processes for domains and subdomains.
Let me know if you want a current patch and I will send it to you.
Dennis
brainstorm wrote:
Hi,
Is there a configurable setting to limit the number of links to fetch
for every domain in nutch ? I'm not referring to the topN setting
which sorts a fetchlist using lucene (or nutch) scoring mechanism.
In other words, I just want to fetch 1500 links from every
"whatever.com" domain *including* all their subdomains.
Ex:
upc.edu : 1500 links hard limit counter
escert.upc.edu: *new* 1500 links hard limit counter
ac.upc.edu: *new* 1500 links hard limit counter
*or* shared links hard limit per domain:
*.upc.edu: 1500 links hard limit
Thanks in advance !