Hi everybody ! I'm using nutch-0.9 to crawl and index more than 1 000 000 pages on an Intranet. This process takes a lot of time (more than one day) and everyday I have pages which are updated. So I would like to know how it is possible to re-index these pages ?
I have downloaded this patch : https://issues.apache.org/jira/browse/NUTCH-601 which allows me to index without deleting my crawl directory. First I index all the page I need of my Intranet : bin/nutch crawl urls -dir crawldir -depth 3 -force (-force coming with the patch) Then I try this to index the pages which has been updated : bin/nutch crawl maj -dir crawldir -depth 3 -force (maj is the directory containing the updated files) But when I do it Nutch index pages I don't need like phpmyadmin which is on the server of the enterprise but not the files which are in maj directory. I have also try to launch : bin/nutch inject crawldir/crawldb maj and then re try : bin/nutch crawl urls -dir crawldir -depth 3 -force but it do the same as before or Nutch says me that it has nothing to index... Any idea for this ? I really need help ! I try to solve this problem since 2 days but I can't solve it... Thank's in advance for your help Jisay _________________________________________________________________ Envoyez vos voeux de façon originale grâce aux nombreuses solutions de Windows Live ! http://get.live.com
