Thanks. I just wasn't seeing it for some reason. So what's the difference between adding entries in regex-urlfilter.txt and adding them to less crawl-urlfilter.txt?
Thanks, Jake. -----Original Message----- From: Gal Nitzan [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 20, 2005 3:27 PM To: [email protected] Subject: Re: Fetching FAQ Vanderdray, Jake wrote: > I was reading through the FAQ and had a follow-up to one of the > questions on there. Here's what's on the FAQ: > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Is it possible to fetch only pages from some specific domains? > > Please have a look on PrefixURLFilter. Adding some regular expressions > to the urlfilter.regex.file might work, but adding a list with thousands > of regular expressions would slow down your system excessively. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > I see the urlfilter.prefix.file entry in conf/nutch-default.xml, > but don't see any corresponding file (regex-urlfilter.txt). Am I just > missing it, or does it need to be created from scratch. If the later, > what is the format? I'll update the FAQ with the answers. > > Thanks, > Jake. > > . > > Hi, You are probably missing it (or mistakenly deleted it) since it is a part of the tar or zip file. look for conf/regex-urlfilter.txt Gal. Here is the default file contents (so create if if it doesn't exist ############################################## # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m ov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept anything else +. ############################## End
