Recrawl URL already in database

Jean-Christophe Alleman Thu, 20 Mar 2008 01:50:02 -0700

Hi everybody !

I'm using nutch-0.9 to crawl and index more than 1 000 000 pages on an 
Intranet. This process takes a lot of time (more than one day)  and everyday I 
have pages which are updated. So I would like to know how it is possible to 
re-index these pages ?


I have downloaded this patch : https://issues.apache.org/jira/browse/NUTCH-601 
which allows me to index without deleting my crawl directory.

First I index all the page I need of my Intranet : 

bin/nutch crawl urls -dir crawldir -depth 3 -force    (-force coming with the 
patch)

Then I try this to index the pages which has been updated : 

bin/nutch crawl maj -dir crawldir -depth 3 -force      (maj is the directory 
containing the updated files)

But when I do it Nutch index pages I don't need like phpmyadmin which is on the 
server of the enterprise but not the files which are in maj directory.

I have also try to launch :

bin/nutch inject crawldir/crawldb maj 

and then re try : 

bin/nutch crawl urls -dir crawldir -depth 3 -force 

but it do the same as before or Nutch says me that it has nothing to index...

Any idea for this ? I really need help ! I try to solve this problem since 2 
days but I can't solve it...

Thank's in advance for your help

Jisay
_________________________________________________________________
Envoyez vos voeux de façon originale grâce aux nombreuses solutions de Windows 
Live !
http://get.live.com

Recrawl URL already in database

Reply via email to