Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:

Hi, I have problem when I am using black-white list url filtering. I have two directiory for filtering
called NegativeURLS and PositiveURLS

********************************************************************** *******************
in NegativeURLS, I have
www.hurriyet.com.tr

in PostiveURLS, I have www.milliyet.com.tr

********************************************************************** *******************
In the input directory for Crawl operation, I have
www.hurriyet.com.tr
www.milliyet.com.tr

I run the following commands from shell.

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ PositiveURLS/ -white

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ NegativeURLS/ -black

Then I run inject,generate and Fetch, After that I run following
$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/ trace/output/segments/20060522115951/

Finally I run GenericReader and I print the output, it contains the URLs that are in the blacklist,
what can be the problem?

The Black/White List works only in the update process (BWUpdateDb), not by fetching or generating. Only the white Urls will be updated to the crawldb.

Are only www.hurriyet.com.tr in your crawldb or other html sites from this host? And what is the status of this urls (STATUS_DB_FETCHED or STATUS_DB_UNFETCHED )?

Marko

Reply via email to