In this case a suffix filter would be better. You can use suffix filters by making sure the plugin.includes variable in the nutch-*.xml file has the urlfilters configured with the urlfilter variable like so:
urlfilter-(suffix)... Then you will need the suffix-urlfilter.txt file in the conf directory. Below is a configuration that only crawls pages with specific suffixes. On the suffix we start by allowing everything and then specifically deny certain file types. Dennis # suffix-urlfilter.txt file starts here # case-insensitive, allow unknown suffixes +I # prohibit these .gif .jpg .jpeg .bmp .png .ico .css .sit .eps .wmf .zip .ppt .mpg .xls .gz .tar .rpm .rm .tgz .mov .exe .vid .ai .pdf .txt .psd # suffix-urlfilter.txt file ends here Tobias Zahn wrote: > Good evening everybody! > I have looked up Google, the FAQs and so on but I didn't find anything > on how to get only some types of files indexed (e.g. every file ending > on .php and .htm). Is there a way to do this? > > It would be also helpfull for me, if it was possible to get a list of > all indexed urls of this filetypes. > > TIA, > Tobias Zahn ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
