Marko Bauhardt wrote:
Am 22.05.2006 um 13:50 schrieb Murat Ali Bayir:
Hi, I have problem when I am using black-white list url filtering. I
have two directiory for filtering
called NegativeURLS and PositiveURLS
**********************************************************************
*******************
in NegativeURLS, I have
www.hurriyet.com.tr
in PostiveURLS, I have www.milliyet.com.tr
**********************************************************************
*******************
In the input directory for Crawl operation, I have
www.hurriyet.com.tr
www.milliyet.com.tr
I run the following commands from shell.
$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/
PositiveURLS/ -white
$ ./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/
NegativeURLS/ -black
Then I run inject,generate and Fetch, After that I run following
$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/
trace/output/segments/20060522115951/
Finally I run GenericReader and I print the output, it contains the
URLs that are in the blacklist,
what can be the problem?
The Black/White List works only in the update process (BWUpdateDb),
not by fetching or generating. Only the white Urls will be updated to
the crawldb.
Are only www.hurriyet.com.tr in your crawldb or other html sites from
this host? And what is the status of this urls (STATUS_DB_FETCHED or
STATUS_DB_UNFETCHED )?
Marko
The crawldb contains the following
http://hurriyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
http://milliyet.com.tr/ Version: 4
Status: 1 (DB_unfetched)
Fetch time: Mon May 22 19:10:31 EEST 2006
Modified time: Thu Jan 01 02:00:00 EET 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
both of them is DB_unfetched.
PostiveURL is http://milliyet.com.tr
it is in ~/URL/PositiveURLS/Positive.txt
NegativeURL is http://hurriyet.com.tr
it is in ~/URL/NegativeURLS/Negative.txt
I run the following inject command
./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/PositiveURLS/
-white
./nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/URL/NegativeURLS/
-black
After fetch command with parsing option
I run the following
$ ./nutch org.apache.nutch.crawl.bw.BWUpdateDb <crawldb> bwdb ~/
trace/output/segments/20060522115951/
Any suggestion for two DB_unfetched entry? I expect one them is fetched.