Re: Recrawl URL already in database

Bradford Stephens Thu, 20 Mar 2008 11:50:51 -0700

I'm running into similar issues with selectively recrawling certain
web pages. There's things that are updated every few minutes that I
want to crawl, and other things that those pages link to that I want
to avoid. I'm thinking of using a frequency parameter in a DB of URLs,
combined with the injection, but I haven't quite figured out the best
paradigm.


On Thu, Mar 20, 2008 at 1:49 AM, Jean-Christophe Alleman
<[EMAIL PROTECTED]> wrote:
>
>  Hi everybody !
>
>  I'm using nutch-0.9 to crawl and index more than 1 000 000 pages on an 
> Intranet. This process takes a lot of time (more than one day)  and everyday 
> I have pages which are updated. So I would like to know how it is possible to 
> re-index these pages ?
>
>  I have downloaded this patch : 
> https://issues.apache.org/jira/browse/NUTCH-601 which allows me to index 
> without deleting my crawl directory.
>
>  First I index all the page I need of my Intranet :
>
>  bin/nutch crawl urls -dir crawldir -depth 3 -force    (-force coming with 
> the patch)
>
>  Then I try this to index the pages which has been updated :
>
>  bin/nutch crawl maj -dir crawldir -depth 3 -force      (maj is the directory 
> containing the updated files)
>
>  But when I do it Nutch index pages I don't need like phpmyadmin which is on 
> the server of the enterprise but not the files which are in maj directory.
>
>  I have also try to launch :
>
>  bin/nutch inject crawldir/crawldb maj
>
>  and then re try :
>
>  bin/nutch crawl urls -dir crawldir -depth 3 -force
>
>  but it do the same as before or Nutch says me that it has nothing to index...
>
>  Any idea for this ? I really need help ! I try to solve this problem since 2 
> days but I can't solve it...
>
>  Thank's in advance for your help
>
>  Jisay
>  _________________________________________________________________
>  Envoyez vos voeux de façon originale grâce aux nombreuses solutions de 
> Windows Live !
>  http://get.live.com

Re: Recrawl URL already in database

Reply via email to