Hi Andy,

Such a change seems reasonable. You can open an issue and attach patches there.
https://issues.apache.org/jira/browse/NUTCH

Markus

-----Original message-----
From: Andy Boothe [WCG]<[email protected]>
Sent: Sunday 3rd November 2013 17:03
To: [email protected]
Subject: Best way to avoid filtering outlinks? [PATCH]

I'm new to Nutch and these mailing lists, so I apologize in advance if I'm 
mailing the wrong list. However, I'm suggesting a patch, so I think dev is The 
Right Place for this conversation.

I'm currently trying to perform a deep crawl within a small list of sites. To 
accomplish this, I updatedconf/domain-urlfilter.txtwith the domains of the 
sites I wish to scrape, which worked nicely. However, I found that not only 
were the links crawled at every step filtered, but the outlinks captured from 
each page crawled were filtered as well.

I'm trying to perform some simple analysis on the domains I'm scraping, and 
outlinks is one of the things I'd like to analyze. Is there a way to avoid 
filtering captured outlinks while still filtering crawled URLs?

The way I ended up solving the problem was by making a small code change to 
Nutch 2.x. Specifically, I introduced the following new property to 
conf/nutch-default.xml:

<property>

 <name>parser.html.outlinks.filter</name>

 <value>true</value>

 <description>By default, outlinks are filtered at the time documents

 are parsed, so only filtered outlinks are indexed. This option allows

 you to disable outlink filtering, so all outlinks are indexed. Note

 that links are filtered in the generate step by default, so indexing

 unfiltered outlinks typically will not affect your crawl.</description>

</property>

And then made a small code change 
tosrc/java/org/apache/nutch/parse/ParseUtil.java that doesn't filter outlinks 
if the property is set to false.

Is there a better way to solve this problem? If so, I'd love to use that 
instead! However, if there isn't a better way, would the Nutch team entertain 
this code change as a patch? I'd be happy to send it along, if you guys think 
it would be useful.

Andy Boothe

Director of Data Engineering, Analytics

WCG

60 Francisco St, San Francisco, 94133

direct512-961-3993      cell979-574-1089

twitter@sigpwned <http://twitter.com/sigpwned>  linkedinaboothe 
<http://www.linkedin.com/in/aboothe>

githubsigpwned <https://github.com/sigpwned>            stack overflow sigpwned 
<http://stackoverflow.com/users/2103602/sigpwned>

• • • • • • • Go. Ahead.

…………………………………………


Reply via email to