Hi Andy, Such a change seems reasonable. You can open an issue and attach patches there. https://issues.apache.org/jira/browse/NUTCH
Markus -----Original message----- From: Andy Boothe [WCG]<[email protected]> Sent: Sunday 3rd November 2013 17:03 To: [email protected] Subject: Best way to avoid filtering outlinks? [PATCH] I'm new to Nutch and these mailing lists, so I apologize in advance if I'm mailing the wrong list. However, I'm suggesting a patch, so I think dev is The Right Place for this conversation. I'm currently trying to perform a deep crawl within a small list of sites. To accomplish this, I updatedconf/domain-urlfilter.txtwith the domains of the sites I wish to scrape, which worked nicely. However, I found that not only were the links crawled at every step filtered, but the outlinks captured from each page crawled were filtered as well. I'm trying to perform some simple analysis on the domains I'm scraping, and outlinks is one of the things I'd like to analyze. Is there a way to avoid filtering captured outlinks while still filtering crawled URLs? The way I ended up solving the problem was by making a small code change to Nutch 2.x. Specifically, I introduced the following new property to conf/nutch-default.xml: <property> <name>parser.html.outlinks.filter</name> <value>true</value> <description>By default, outlinks are filtered at the time documents are parsed, so only filtered outlinks are indexed. This option allows you to disable outlink filtering, so all outlinks are indexed. Note that links are filtered in the generate step by default, so indexing unfiltered outlinks typically will not affect your crawl.</description> </property> And then made a small code change tosrc/java/org/apache/nutch/parse/ParseUtil.java that doesn't filter outlinks if the property is set to false. Is there a better way to solve this problem? If so, I'd love to use that instead! However, if there isn't a better way, would the Nutch team entertain this code change as a patch? I'd be happy to send it along, if you guys think it would be useful. Andy Boothe Director of Data Engineering, Analytics WCG 60 Francisco St, San Francisco, 94133 direct512-961-3993 cell979-574-1089 twitter@sigpwned <http://twitter.com/sigpwned> linkedinaboothe <http://www.linkedin.com/in/aboothe> githubsigpwned <https://github.com/sigpwned> stack overflow sigpwned <http://stackoverflow.com/users/2103602/sigpwned> • • • • • • • Go. Ahead. …………………………………………

