Marko Bauhardt wrote:


Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:

Hi, I have problem when I am using black-white list url filtering. I have two directiory for filtering
called NegativeURLS and PositiveURLS

********************************************************************** *******************
in NegativeURLS, I have
www.hurriyet.com.tr

in PostiveURLS, I have www.milliyet.com.tr

********************************************************************** *******************
In the input directory for Crawl operation, I have
www.hurriyet.com.tr
www.milliyet.com.tr

I run the following commands from shell.

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ PositiveURLS/ -white

$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/ NegativeURLS/ -black

Then I run inject,generate and Fetch, After that I run following
$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/ trace/output/segments/20060522115951/

Finally I run GenericReader and I print the output, it contains the URLs that are in the blacklist,
what can be the problem?


The Black/White List works only in the update process (BWUpdateDb), not by fetching or generating. Only the white Urls will be updated to the crawldb.

Are only www.hurriyet.com.tr in your crawldb or other html sites from this host? And what is the status of this urls (STATUS_DB_FETCHED or STATUS_DB_UNFETCHED )?

Marko



The crawldb contains the following

http://hurriyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null

http://milliyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null


both of them is DB_unfetched.

PostiveURL is http://milliyet.com.tr
it is in ~/URL/PositiveURLS/Positive.txt

NegativeURL is http://hurriyet.com.tr
it is in ~/URL/NegativeURLS/Negative.txt

I run the following inject command

./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/PositiveURLS/ -white ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/NegativeURLS/ -black

After fetch command with parsing option

I run the following

$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/ trace/output/segments/20060522115951/


Any suggestion for two DB_unfetched entry? I expect one them is fetched.

Reply via email to