I wish to use Nutch so that it would crawl the urls contained into a file (let say urls/urls.txt) but would stay only within these. I have been using Nutch for a few weeks now but it bothers me to see that the crawler goes visiting the ads on websites and indexes their content. Most of the time, the crawler ends up analysing some content about "free ipod, discount stuff and traveltoBananaIsland.com" related sites while I'm not interested at all having those in the index.
I know that conf/crawl-urlfilter.txt could be used to that purpose but I was wondering if there would be a single line in a conf file that would turn a such feature on. I would prefer avoiding to do regexp and just care about feeding the crawler plain urls.