Hi Stefan, Sorry I don't found the mail that you related :(.
Look at this shell script (I'm using the Cygwin in Windows 2000): #!/bin/bash # Set JAVA_HOME to reflect your systems java configuration export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0 # Start index updation bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000 s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Segment is $s bin/nutch fetch $s bin/nutch updatedb crawl-LEGISLA /db $s bin/nutch analyze crawl-LEGISLA /db 5 bin/nutch index $s bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile # Merge segments to prevent too many open files exception in Lucene bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds s=`ls -d crawl-LEGISLA/segments/2* | tail -1` echo Merged Segment is $s rm -rf crawl-LEGISLA/index I found it in the wiki page of the nutch project. It has some errors in execution time. I don't know if is it correct... Do you have other example of how to do this job? On 6/9/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote:
Lourival Júnior wrote: > Hi all! > > I have some problems with update my WebDB. I've a page, test.htm, that > has 4 > links to 4 pdf's documents. I execute the crawler then when I do this > command: > > bin/nutch readdb Mydir/db -stats > > I get this output: > > Number of pages: 5 > Number of links: 4 > > That's ok. The problem is when I add more 4 links to the test.htm. I want a > script that re crawl or update my WebDB without I have to delete Mydir > folder. I hope I am being clearly. > I found some shell scripts to do this, however it's don't do what I want. > Always I get the same number of pages and links. > > Anyone can help me? Hi, please re-read from the mailinglist-archives as of ... hmm ... yesterday I think. You'll have to do a small modification to be able to re-inject your URL to start re-crawling it on the next run. Otherwise a page will only be re-crawled after a configurable amount of days, which is the same value also used for the PDFs. Regards, Stefan
-- Lourival Junior Universidade Federal do Pará Curso de Bacharelado em Sistemas de Informação http://www.ufpa.br/cbsi Msn: [EMAIL PROTECTED]
_______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
