Re: Fetching FAQ

Gal Nitzan Tue, 20 Sep 2005 11:27:15 -0700

Vanderdray, Jake wrote:

        I was reading through the FAQ and had a follow-up to one of the
questions on there.  Here's what's on the FAQ:


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Is it possible to fetch only pages from some specific domains?

Please have a look on PrefixURLFilter. Adding some regular expressions
to the urlfilter.regex.file might work, but adding a list with thousands
of regular expressions would slow down your system excessively.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        I see the urlfilter.prefix.file entry in conf/nutch-default.xml,
but don't see any corresponding file (regex-urlfilter.txt).  Am I just
missing it, or does it need to be created from scratch.  If the later,
what is the format?  I'll update the FAQ with the answers.

Thanks,
Jake.

.

Hi,

You are probably missing it (or mistakenly deleted it) since it is apart of the tar or zip file.

look for conf/regex-urlfilter.txt

Gal.

Here is the default file contents (so create if if it doesn't exist
##############################################
# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+.
############################## End

Re: Fetching FAQ

Reply via email to