I'm running into similar issues with selectively recrawling certain web pages. There's things that are updated every few minutes that I want to crawl, and other things that those pages link to that I want to avoid. I'm thinking of using a frequency parameter in a DB of URLs, combined with the injection, but I haven't quite figured out the best paradigm.
On Thu, Mar 20, 2008 at 1:49 AM, Jean-Christophe Alleman <[EMAIL PROTECTED]> wrote: > > Hi everybody ! > > I'm using nutch-0.9 to crawl and index more than 1 000 000 pages on an > Intranet. This process takes a lot of time (more than one day) and everyday > I have pages which are updated. So I would like to know how it is possible to > re-index these pages ? > > I have downloaded this patch : > https://issues.apache.org/jira/browse/NUTCH-601 which allows me to index > without deleting my crawl directory. > > First I index all the page I need of my Intranet : > > bin/nutch crawl urls -dir crawldir -depth 3 -force (-force coming with > the patch) > > Then I try this to index the pages which has been updated : > > bin/nutch crawl maj -dir crawldir -depth 3 -force (maj is the directory > containing the updated files) > > But when I do it Nutch index pages I don't need like phpmyadmin which is on > the server of the enterprise but not the files which are in maj directory. > > I have also try to launch : > > bin/nutch inject crawldir/crawldb maj > > and then re try : > > bin/nutch crawl urls -dir crawldir -depth 3 -force > > but it do the same as before or Nutch says me that it has nothing to index... > > Any idea for this ? I really need help ! I try to solve this problem since 2 > days but I can't solve it... > > Thank's in advance for your help > > Jisay > _________________________________________________________________ > Envoyez vos voeux de façon originale grâce aux nombreuses solutions de > Windows Live ! > http://get.live.com
