This one here: http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04829.html
Regards, Stefan Lourival Júnior wrote: > Hi Stefan, > > Sorry I don't found the mail that you related :(. > > Look at this shell script (I'm using the Cygwin in Windows 2000): > > #!/bin/bash > > # Set JAVA_HOME to reflect your systems java configuration > export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0 > > # Start index updation > bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000 > s=`ls -d crawl-LEGISLA/segments/2* | tail -1` > echo Segment is $s > bin/nutch fetch $s > bin/nutch updatedb crawl-LEGISLA /db $s > bin/nutch analyze crawl-LEGISLA /db 5 > bin/nutch index $s > bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile > > # Merge segments to prevent too many open files exception in Lucene > bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds > s=`ls -d crawl-LEGISLA/segments/2* | tail -1` > echo Merged Segment is $s > > rm -rf crawl-LEGISLA/index > > I found it in the wiki page of the nutch project. It has some errors in > execution time. I don't know if is it correct... Do you have other example > of how to do this job? > > On 6/9/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote: >> >> Lourival Júnior wrote: >> > Hi all! >> > >> > I have some problems with update my WebDB. I've a page, test.htm, that >> > has 4 >> > links to 4 pdf's documents. I execute the crawler then when I do this >> > command: >> > >> > bin/nutch readdb Mydir/db -stats >> > >> > I get this output: >> > >> > Number of pages: 5 >> > Number of links: 4 >> > >> > That's ok. The problem is when I add more 4 links to the test.htm. I >> want a >> > script that re crawl or update my WebDB without I have to delete Mydir >> > folder. I hope I am being clearly. >> > I found some shell scripts to do this, however it's don't do what I >> want. >> > Always I get the same number of pages and links. >> > >> > Anyone can help me? >> >> Hi, >> >> please re-read from the mailinglist-archives as of ... hmm ... yesterday >> I think. You'll have to do a small modification to be able to re-inject >> your URL to start re-crawling it on the next run. Otherwise a page will >> only be re-crawled after a configurable amount of days, which is the >> same value also used for the PDFs. >> >> >> Regards, >> Stefan _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
