I use the prefix and suffix url filters instead of the regex url filter.
You can use suffix filters by making sure the plugin.includes variable
in the nutch-*.xml file has the urlfilters configured with the urlfilter
variable like so, you currently have urlfilter-regex:
urlfilter-(prefix|suffix)...
Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt
files in the conf directory. Below is a configuration that only crawls
http pages with specific suffixes. On the suffix we start by allowing
everything and then specifically deny certain file types.
Dennis Kubes
# prefix-urlfilter.txt file starts here
http
# prefix-urlfilter.txt file ends here
# suffix-urlfilter.txt file starts here
# case-insensitive, allow unknown suffixes
+I
# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
.css
.js
# suffix-urlfilter.txt file ends here
cybercouf wrote:
[nutch 0.8.1]
I want to crawl only web 'html' content, in all xxx-urlfilter.txt I have:
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|swf|iso|pdf|PDF|js|avi|doc)$
and I load only the plugins I need (I think)
<value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
But when I dump a segment or the linkdb, I can see I have lot of outlinks on
non web-page, like:
outlink: toUrl: http://domain.com/images/img.gif anchor:
outlink: toUrl: http://domain.com/images/img.jpg anchor:
outlink: toUrl: http://domain.com/style.css anchor:
Where can I configure nutch to link only web-page?
I saw the java code in this function:
DOMContentUtils.setConf(Configuration conf)
...
linkParams.put("img", new LinkParams("img", "src", 0));
Maybe I can just comment the line, but it looks not the good way to do it,
it's better with a configuration file.