I wish to use Nutch so that it would crawl the urls contained into a
file (let say urls/urls.txt) but would stay only within these. I have
been using Nutch for a few weeks now but it bothers me to see that the
crawler goes visiting the ads on websites and indexes their content.
Most of the time, the crawler ends up analysing some content about
"free ipod, discount stuff and traveltoBananaIsland.com" related sites
while I'm not interested at all having those in the index.

I know that conf/crawl-urlfilter.txt could be used to that purpose but
I was wondering if there would be a single line in a conf file that
would turn a such feature on. I would prefer avoiding to do regexp and
just care about feeding the crawler plain urls.

Reply via email to