Joe, What are all the rules in your regex-urlfilter.txt file? When a URL is checked against the RegexURLFilter, the URL is checked against each regex expression iteratively. If it hits a regex rule that matches, it will either reject or accept the URL, depending on the + or - in front of the rule.
So it may be possible that you have a rule which accepts your af.wikipedia.org URL's before it even processes the -^http://af.wikipedia.org/ rule. For example, if your regex-urlfilter.txt file looked like: +^http -^http://af.wikipedia.org/ All af.wikipedia.org url's would still be accepted. By the way, the RegexURLFilter class has a main method where you can feed in URL's via stdin for testing purposes. This has been very useful for me in the past. Andy On 7/31/05, Feng (Michael) Ji <[EMAIL PROTECTED]> wrote: > did you use script call of bin/nutch crawl... > > I used to try one time, if I want a intranet crawling > and put a restrict domain in crawl-urlfilter.txt, it > only fetching pages within that domain, > > Michael, > > --- Vacuum Joe <[EMAIL PROTECTED]> wrote: > > > > > > Hello Joe, > > > If you are using whole web crawling you should > > > change regex-urlfilter.txt > > > insead of crawl-urlfilter.txt. > > > > Hi Piotr, > > > > Thanks for the tip. I tried that. I put: > > > > -^http://af.wikipedia.org/ > > > > in both regex-urlfilter.txt and crawl-urlfilter.txt. > > > > I even put in a bogus entry for af.wikipedia.org in > > my > > /etc/hosts, and yet when I run a fetch using > > > > nutch fetch segments/244444444 > > > > it still is fetching from af.wikipedia.org, and > > about > > one third of my segment data is in Afrikaans, and of > > no value to me. Is there any other way to do this? > > I'm thinking of putting a rule in the firewall to > > block traffic to that IP addr. But surely there's > > some way to tell Fetch "never ever go to this > > server"? > > That seems like a very important thing to have, > > because a) some servers have undesirable content and > > b) some servers have "spider trap" content that will > > suck in the whole fetch. Any ideas? > > > > Thanks > > > > > On 7/28/05, Vacuum Joe <[EMAIL PROTECTED]> > > wrote: > > > > I have a simple question: I'm using Nutch to do > > > some > > > > whole-web crawling (just a small dataset). > > > Somehow > > > > Nutch has gotten a lot of URLs from > > > af.wikipedia.org > > > > into its segments, and when I generate another > > > > segments (using -topN 20000) it wants to crawl a > > > bunch > > > > more urls from af.wikipedia.org. I don't want > > to > > > > crawl any of the Afrikaans Wikipedia. Is there > > a > > > way > > > > to block that? Also, I want to block it from > > ever > > > > crawling domains like 33.44.55.66, because those > > > are > > > > usually very badly configured servers with > > > worthless > > > > content. > > > > > > > > I tried to put those things into > > > crawl-urlfilter.txt > > > > file and the banned-hosts.txt file, but it seems > > > that > > > > the fetch command doesn't pay attention to those > > > two > > > > files. > > > > > > > > Should I be using crawl instead of fetch? > > > > > > > > > > > > > > __________________________________________________ > > > > Do You Yahoo!? > > > > Tired of spam? Yahoo! Mail has the best spam > > > protection around > > > > http://mail.yahoo.com > > > > > > > > > > > > > __________________________________________________ > > Do You Yahoo!? > > Tired of spam? Yahoo! Mail has the best spam > > protection around > > http://mail.yahoo.com > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com >
