I'm new to Nutch and these mailing lists, so I apologize in advance if I'm 
mailing the wrong list. However, I'm suggesting a patch, so I think dev is The 
Right Place for this conversation.


I'm currently trying to perform a deep crawl within a small list of sites. To 
accomplish this, I updatedconf/domain-urlfilter.txtwith the domains of the 
sites I wish to scrape, which worked nicely. However, I found that not only 
were the links crawled at every step filtered, but the outlinks captured from 
each page crawled were filtered as well.

I'm trying to perform some simple analysis on the domains I'm scraping, and 
outlinks is one of the things I'd like to analyze. Is there a way to avoid 
filtering captured outlinks while still filtering crawled URLs?

The way I ended up solving the problem was by making a small code change to 
Nutch 2.x. Specifically, I introduced the following new property to 
conf/nutch-default.xml:

<property>
  <name>parser.html.outlinks.filter</name>
  <value>true</value>
  <description>By default, outlinks are filtered at the time documents
  are parsed, so only filtered outlinks are indexed. This option allows
  you to disable outlink filtering, so all outlinks are indexed. Note
  that links are filtered in the generate step by default, so indexing
  unfiltered outlinks typically will not affect your crawl.</description>
</property>

And then made a small code change to 
src/java/org/apache/nutch/parse/ParseUtil.java that doesn't filter outlinks if 
the property is set to false.

Is there a better way to solve this problem? If so, I'd love to use that 
instead! However, if there isn't a better way, would the Nutch team entertain 
this code change as a patch? I'd be happy to send it along, if you guys think 
it would be useful.


Andy Boothe

Director of Data Engineering, Analytics


WCG

60 Francisco St, San Francisco, 94133

direct 512-961-3993 cell 979-574-1089

twitter @sigpwned<http://twitter.com/sigpwned> linkedin 
aboothe<http://www.linkedin.com/in/aboothe>

github sigpwned<https://github.com/sigpwned> stack overflow 
sigpwned<http://stackoverflow.com/users/2103602/sigpwned>


• • • • • • • Go. Ahead.

…………………………………………

Reply via email to