In this case a suffix filter would be better.

You can use suffix filters by making sure the plugin.includes variable 
in the nutch-*.xml file has the urlfilters configured with the urlfilter 
variable like so:

urlfilter-(suffix)...

Then you will need the suffix-urlfilter.txt file in the conf directory. 
Below is a configuration that only crawls pages with specific suffixes. 
On the suffix we start by allowing everything and then specifically deny 
certain file types.

Dennis

# suffix-urlfilter.txt file starts here
# case-insensitive, allow unknown suffixes
+I
# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
# suffix-urlfilter.txt file ends here

Tobias Zahn wrote:
> Good evening everybody!
> I have looked up Google, the FAQs and so on but I didn't find anything
> on how to get only some types of files indexed (e.g. every file ending
> on .php and .htm). Is there a way to do this?
> 
> It would be also helpfull for me, if it was possible to get a list of
> all indexed urls of this filetypes.
> 
> TIA,
> Tobias Zahn

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to