Re: [Nutch-dev] Adding new urls in WebDB

Lourival Júnior Fri, 09 Jun 2006 06:17:36 -0700

Hi Stefan,

Sorry I don't found the mail that you related :(.


Look at this shell script (I'm using the Cygwin in Windows 2000):

#!/bin/bash

# Set JAVA_HOME to reflect your systems java configuration
export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0

# Start index updation
bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000
s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
echo Segment is $s
bin/nutch fetch $s
bin/nutch updatedb crawl-LEGISLA /db $s
bin/nutch analyze crawl-LEGISLA /db 5
bin/nutch index $s
bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile

# Merge segments to prevent too many open files exception in Lucene
bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds
s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
echo Merged Segment is $s

rm -rf crawl-LEGISLA/index

I found it in the wiki page of the nutch project. It has some errors in
execution time. I don't know if is it correct... Do you have other example
of how to do this job?

On 6/9/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote:

Lourival Júnior wrote:
> Hi all!
>
> I have some problems with update my WebDB. I've a page, test.htm, that
> has 4
> links to 4 pdf's documents. I execute the crawler then when I do this
> command:
>
> bin/nutch readdb Mydir/db -stats
>
> I get this output:
>
> Number of pages: 5
> Number of links: 4
>
> That's ok. The problem is when I add more 4 links to the test.htm. I
want a
> script that re crawl or update my WebDB without I have to delete Mydir
> folder. I hope I am being clearly.
> I found some shell scripts to do this, however it's don't do what I
want.
> Always I get the same number of pages and links.
>
> Anyone can help me?

Hi,

please re-read from the mailinglist-archives as of ... hmm ... yesterday
I think. You'll have to do a small modification to be able to re-inject
your URL to start re-crawling it on the next run. Otherwise a page will
only be re-crawled after a configurable amount of days, which is the
same value also used for the PDFs.

Regards,
Stefan




--
Lourival Junior
Universidade Federal do Pará
Curso de Bacharelado em Sistemas de Informação
http://www.ufpa.br/cbsi
Msn: [EMAIL PROTECTED]

_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Adding new urls in WebDB

Reply via email to