Hi Nutch developers,
Is there any possibility to write some kind of URL Filter that
allows just certain URLs to gets fetched? I would like that Nutch is
just following some URLs that I allow, whereas seed URLs get further
analyzed.
There are already plugins that support URL filtering, which you can
specify in a number of different ways. See the following plug-ins:
urlfilter-automaton
urlfilter-domain
urlfilter-prefix
urlfilter-regex
urlfilter-suffix
urlfilter-validator
Which one(s) to use depend on your particular goals.
If none of these would work for you, then you can always create a new
plugin that implements the URLFilter interface.
-- Ken
--
Ken Krugler
+1 530-210-6378