Re: [Nutch-dev] Adding new urls in WebDB

Stefan Neufeind Fri, 09 Jun 2006 05:19:23 -0700

Lourival Júnior wrote:
> Hi all!
> 
> I have some problems with update my WebDB. I've a page, test.htm, that
> has 4
> links to 4 pdf's documents. I execute the crawler then when I do this
> command:
> 
> bin/nutch readdb Mydir/db -stats
> 
> I get this output:
> 
> Number of pages: 5
> Number of links: 4
> 
> That's ok. The problem is when I add more 4 links to the test.htm. I want a
> script that re crawl or update my WebDB without I have to delete Mydir
> folder. I hope I am being clearly.
> I found some shell scripts to do this, however it's don't do what I want.
> Always I get the same number of pages and links.
> 
> Anyone can help me?


Hi,

please re-read from the mailinglist-archives as of ... hmm ... yesterday
I think. You'll have to do a small modification to be able to re-inject
your URL to start re-crawling it on the next run. Otherwise a page will
only be re-crawled after a configurable amount of days, which is the
same value also used for the PDFs.


Regards,
 Stefan


_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Adding new urls in WebDB

Reply via email to