did you use script call of bin/nutch crawl... I used to try one time, if I want a intranet crawling and put a restrict domain in crawl-urlfilter.txt, it only fetching pages within that domain,
Michael, --- Vacuum Joe <[EMAIL PROTECTED]> wrote: > > > Hello Joe, > > If you are using whole web crawling you should > > change regex-urlfilter.txt > > insead of crawl-urlfilter.txt. > > Hi Piotr, > > Thanks for the tip. I tried that. I put: > > -^http://af.wikipedia.org/ > > in both regex-urlfilter.txt and crawl-urlfilter.txt. > > I even put in a bogus entry for af.wikipedia.org in > my > /etc/hosts, and yet when I run a fetch using > > nutch fetch segments/244444444 > > it still is fetching from af.wikipedia.org, and > about > one third of my segment data is in Afrikaans, and of > no value to me. Is there any other way to do this? > I'm thinking of putting a rule in the firewall to > block traffic to that IP addr. But surely there's > some way to tell Fetch "never ever go to this > server"? > That seems like a very important thing to have, > because a) some servers have undesirable content and > b) some servers have "spider trap" content that will > suck in the whole fetch. Any ideas? > > Thanks > > > On 7/28/05, Vacuum Joe <[EMAIL PROTECTED]> > wrote: > > > I have a simple question: I'm using Nutch to do > > some > > > whole-web crawling (just a small dataset). > > Somehow > > > Nutch has gotten a lot of URLs from > > af.wikipedia.org > > > into its segments, and when I generate another > > > segments (using -topN 20000) it wants to crawl a > > bunch > > > more urls from af.wikipedia.org. I don't want > to > > > crawl any of the Afrikaans Wikipedia. Is there > a > > way > > > to block that? Also, I want to block it from > ever > > > crawling domains like 33.44.55.66, because those > > are > > > usually very badly configured servers with > > worthless > > > content. > > > > > > I tried to put those things into > > crawl-urlfilter.txt > > > file and the banned-hosts.txt file, but it seems > > that > > > the fetch command doesn't pay attention to those > > two > > > files. > > > > > > Should I be using crawl instead of fetch? > > > > > > > > > > __________________________________________________ > > > Do You Yahoo!? > > > Tired of spam? Yahoo! Mail has the best spam > > protection around > > > http://mail.yahoo.com > > > > > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
