I use the prefix and suffix url filters instead of the regex url filter.

You can use suffix filters by making sure the plugin.includes variable in the nutch-*.xml file has the urlfilters configured with the urlfilter variable like so, you currently have urlfilter-regex:

urlfilter-(prefix|suffix)...


Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt files in the conf directory. Below is a configuration that only crawls http pages with specific suffixes. On the suffix we start by allowing everything and then specifically deny certain file types.

Dennis Kubes

# prefix-urlfilter.txt file starts here
http
# prefix-urlfilter.txt file ends here

# suffix-urlfilter.txt file starts here
# case-insensitive, allow unknown suffixes
+I
# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
.css
.js
# suffix-urlfilter.txt file ends here

cybercouf wrote:
[nutch 0.8.1]

I want to crawl only web 'html' content, in all xxx-urlfilter.txt I have:
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|swf|iso|pdf|PDF|js|avi|doc)$

and I load only the plugins I need (I think)
<value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>

But when I dump a segment or the linkdb, I can see I have lot of outlinks on
non web-page, like:

outlink: toUrl: http://domain.com/images/img.gif anchor: outlink: toUrl: http://domain.com/images/img.jpg anchor: outlink: toUrl: http://domain.com/style.css anchor:
Where can I configure nutch to link only web-page?
I saw the java code in this function:

DOMContentUtils.setConf(Configuration conf)
...
linkParams.put("img", new LinkParams("img", "src", 0));

Maybe I can just comment the line, but it looks not the good way to do it,
it's better with a configuration file.

Reply via email to