Hi all,
I'm sure this question must be asked before, but I just can't find a
detail procedure for doing what I want.
Here is the background of my question. I've crawled a site:
lucene.apargh.org.
and the depth is 3. I'm using the command bin/nutch crawl urls -dir
crawled_lucene -depth 3 to run that.
So now I've got a folder crawled_lucene in the filesystem.
I want to crawl, index a new URL under crawled_lucene, say
"www.abc.com", I'm using Nutch w/ hadoop, so make sure my procedures are right
for it.
1) I need to update the urls folder in the filesystem. (by deleteing the
urls directory and re-put the urls directory with ONLY www.abc.com in it)
2) bin/nutch inject crawled_lucene/crawldb urls
3) bin/nutch generate crawled_lucene/segments segments
4) bin/nutch fetch crawled_lucene/segments/(new segments)
5) bin/nutch updatedb crawled_lucene crawled_lucene/segments/(new
segments)
6) repeat step3 to step5 to the depth that I want. e.g. if my desire
depth is 3 than repeat 3 times.
7) bin/nutch invertlinks linkdb crawled_lucene/segments/(new segments)
8) bin/nutch index crawled_lucene/crawldb linkdb crawled/segments/(new
segments)
9) repeat step 8 for all new segments
10) bin/nutch dedup crawled_lucene/indexes
Is there anyone can confirm my procedures and since I'm crawling a new
url www.abc.com, it has a different host, do i need to change the
crawl-urlfilter.txt file everytime so that it can accept the new host (from
apache.org to abc.com)?
THANKS ~
William
---------------------------------
How low will we go? Check out Yahoo! Messengers low PC-to-Phone call rates.