Vanderdray, Jake wrote:
I was reading through the FAQ and had a follow-up to one of the
questions on there. Here's what's on the FAQ:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is it possible to fetch only pages from some specific domains?
Please have a look on PrefixURLFilter. Adding some regular expressions
to the urlfilter.regex.file might work, but adding a list with thousands
of regular expressions would slow down your system excessively.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I see the urlfilter.prefix.file entry in conf/nutch-default.xml,
but don't see any corresponding file (regex-urlfilter.txt). Am I just
missing it, or does it need to be created from scratch. If the later,
what is the format? I'll update the FAQ with the answers.
Thanks,
Jake.
.
Hi,
You are probably missing it (or mistakenly deleted it) since it is a
part of the tar or zip file.
look for conf/regex-urlfilter.txt
Gal.
Here is the default file contents (so create if if it doesn't exist
##############################################
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# accept anything else
+.
############################## End