[Nutch-general] Procedure to insert new URL into the database in Nutch/hadoop

William Choi Tue, 30 May 2006 17:19:07 -0700

Hi all,
  
       I'm sure this question must be asked before,  but I just can't find a 
detail procedure for doing what I want.
      
       Here is the background of my question. I've crawled a site: 
lucene.apargh.org.
  and the depth is 3. I'm using the command bin/nutch crawl urls -dir 
crawled_lucene -depth 3 to run that.
  
       So now I've got a folder crawled_lucene in the filesystem.
  
       I want to crawl, index a new URL under  crawled_lucene, say 
"www.abc.com", I'm using Nutch w/ hadoop, so make  sure my procedures are right 
for it.
  
      1) I need to update the urls folder in the  filesystem. (by deleteing the 
urls directory and re-put the urls  directory with ONLY www.abc.com in it)
      2) bin/nutch inject crawled_lucene/crawldb urls
      3) bin/nutch generate crawled_lucene/segments segments
      4) bin/nutch fetch crawled_lucene/segments/(new segments)    
      5) bin/nutch updatedb crawled_lucene crawled_lucene/segments/(new 
segments)
      6) repeat step3 to step5 to the depth that I want. e.g. if my desire 
depth is 3 than repeat 3 times.
      7) bin/nutch invertlinks linkdb crawled_lucene/segments/(new segments)
      8) bin/nutch index crawled_lucene/crawldb linkdb crawled/segments/(new 
segments)
      9) repeat step 8 for all new segments
    10) bin/nutch dedup crawled_lucene/indexes
  
     Is there anyone can confirm my procedures and since I'm  crawling a new 
url www.abc.com, it has a different host, do i need to  change the 
crawl-urlfilter.txt file everytime so that it can accept the  new host (from 
apache.org to abc.com)?
  
     THANKS ~
  
  
  William
  
  
       
  
  
                
---------------------------------
How low will we go? Check out Yahoo! Messengers low  PC-to-Phone call rates.

[Nutch-general] Procedure to insert new URL into the database in Nutch/hadoop

Reply via email to