I'm new to Nutch and these mailing lists, so I apologize in advance if I'm mailing the wrong list. However, I'm suggesting a patch, so I think dev is The Right Place for this conversation.
I'm currently trying to perform a deep crawl within a small list of sites. To accomplish this, I updatedconf/domain-urlfilter.txtwith the domains of the sites I wish to scrape, which worked nicely. However, I found that not only were the links crawled at every step filtered, but the outlinks captured from each page crawled were filtered as well. I'm trying to perform some simple analysis on the domains I'm scraping, and outlinks is one of the things I'd like to analyze. Is there a way to avoid filtering captured outlinks while still filtering crawled URLs? The way I ended up solving the problem was by making a small code change to Nutch 2.x. Specifically, I introduced the following new property to conf/nutch-default.xml: <property> <name>parser.html.outlinks.filter</name> <value>true</value> <description>By default, outlinks are filtered at the time documents are parsed, so only filtered outlinks are indexed. This option allows you to disable outlink filtering, so all outlinks are indexed. Note that links are filtered in the generate step by default, so indexing unfiltered outlinks typically will not affect your crawl.</description> </property> And then made a small code change to src/java/org/apache/nutch/parse/ParseUtil.java that doesn't filter outlinks if the property is set to false. Is there a better way to solve this problem? If so, I'd love to use that instead! However, if there isn't a better way, would the Nutch team entertain this code change as a patch? I'd be happy to send it along, if you guys think it would be useful. Andy Boothe Director of Data Engineering, Analytics WCG 60 Francisco St, San Francisco, 94133 direct 512-961-3993 cell 979-574-1089 twitter @sigpwned<http://twitter.com/sigpwned> linkedin aboothe<http://www.linkedin.com/in/aboothe> github sigpwned<https://github.com/sigpwned> stack overflow sigpwned<http://stackoverflow.com/users/2103602/sigpwned> • • • • • • • Go. Ahead. …………………………………………

