I was reading through the FAQ and had a follow-up to one of the
questions on there. Here's what's on the FAQ:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is it possible to fetch only pages from some specific domains?
Please have a look on PrefixURLFilter. Adding some regular expressions
to the urlfilter.regex.file might work, but adding a list with thousands
of regular expressions would slow down your system excessively.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I see the urlfilter.prefix.file entry in conf/nutch-default.xml,
but don't see any corresponding file (regex-urlfilter.txt). Am I just
missing it, or does it need to be created from scratch. If the later,
what is the format? I'll update the FAQ with the answers.
Thanks,
Jake.