Re: [Nutch-general] Nutch crawls blocked sites - Why?

Doğacan Güney Mon, 28 May 2007 03:50:03 -0700

Hi,

On 5/28/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> In my crawl-urlfilter.txt I have put a statement like
>
> -^http://cdserver
>
> Still while running crawl, it fetches this site. I am running the
> crawl using these commands:-
>
> bin/nutch inject crawl/crawldb urls
>
> Inside a loop:-
>
> bin/nutch generate crawl/crawldb crawl/segments -topN 10
> segment=`ls -d crawl/segments/* | tail -1`
> bin/nutch fetch $segment -threads 10
> bin/nutch updatedb crawl/crawldb $segment
>
> Why does it fetch http://cdserver even though I have blocked it? Is it
> becoming "allowed" from some other filter file? If so, what do I need
> to check. Please help.
>


In your case, crawl-urlfilter.txt is not read because you are not
running 'crawl' command (as in bin/nutch crawl). You have to update
regex-urlfilter.txt or prefix-urlfilter.txt and make sure that you
enable them in your conf.

-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch crawls blocked sites - Why?

Reply via email to