Joe,

What are all the rules in your regex-urlfilter.txt file?  When a URL
is checked against the RegexURLFilter, the URL is checked against each
regex expression iteratively.  If it hits a regex rule that matches,
it will either reject or accept the URL, depending on the + or - in
front of the rule.

So it may be possible that you have a rule which accepts your
af.wikipedia.org URL's before it even processes the
-^http://af.wikipedia.org/ rule.

For example, if your regex-urlfilter.txt file looked like:

+^http
-^http://af.wikipedia.org/

All af.wikipedia.org url's would still be accepted.

By the way, the RegexURLFilter class has a main method where you can
feed in URL's via stdin for testing purposes.  This has been very
useful for me in the past.

Andy

On 7/31/05, Feng (Michael) Ji <[EMAIL PROTECTED]> wrote:
> did you use script call of bin/nutch crawl...
> 
> I used to try one time, if I want a intranet crawling
> and put a restrict domain in crawl-urlfilter.txt, it
> only fetching pages within that domain,
> 
> Michael,
> 
> --- Vacuum Joe <[EMAIL PROTECTED]> wrote:
> 
> >
> > > Hello Joe,
> > > If you are using whole web crawling you should
> > > change regex-urlfilter.txt
> > > insead of crawl-urlfilter.txt.
> >
> > Hi Piotr,
> >
> > Thanks for the tip.  I tried that.  I put:
> >
> > -^http://af.wikipedia.org/
> >
> > in both regex-urlfilter.txt and crawl-urlfilter.txt.
> >
> > I even put in a bogus entry for af.wikipedia.org in
> > my
> > /etc/hosts, and yet when I run a fetch using
> >
> > nutch fetch segments/244444444
> >
> > it still is fetching from af.wikipedia.org, and
> > about
> > one third of my segment data is in Afrikaans, and of
> > no value to me.  Is there any other way to do this?
> > I'm thinking of putting a rule in the firewall to
> > block traffic to that IP addr.  But surely there's
> > some way to tell Fetch "never ever go to this
> > server"?
> >  That seems like a very important thing to have,
> > because a) some servers have undesirable content and
> > b) some servers have "spider trap" content that will
> > suck in the whole fetch.  Any ideas?
> >
> > Thanks
> >
> > > On 7/28/05, Vacuum Joe <[EMAIL PROTECTED]>
> > wrote:
> > > > I have a simple question: I'm using Nutch to do
> > > some
> > > > whole-web crawling (just a small dataset).
> > > Somehow
> > > > Nutch has gotten a lot of URLs from
> > > af.wikipedia.org
> > > > into its segments, and when I generate another
> > > > segments (using -topN 20000) it wants to crawl a
> > > bunch
> > > > more urls from af.wikipedia.org.  I don't want
> > to
> > > > crawl any of the Afrikaans Wikipedia.  Is there
> > a
> > > way
> > > > to block that?  Also, I want to block it from
> > ever
> > > > crawling domains like 33.44.55.66, because those
> > > are
> > > > usually very badly configured servers with
> > > worthless
> > > > content.
> > > >
> > > > I tried to put those things into
> > > crawl-urlfilter.txt
> > > > file and the banned-hosts.txt file, but it seems
> > > that
> > > > the fetch command doesn't pay attention to those
> > > two
> > > > files.
> > > >
> > > > Should I be using crawl instead of fetch?
> > > >
> > > >
> > > >
> > __________________________________________________
> > > > Do You Yahoo!?
> > > > Tired of spam?  Yahoo! Mail has the best spam
> > > protection around
> > > > http://mail.yahoo.com
> > > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam
> > protection around
> > http://mail.yahoo.com
> >
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Reply via email to