Re: [Nutch-dev] Adding new urls in WebDB

Stefan Neufeind Fri, 09 Jun 2006 10:22:52 -0700

This one here:

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04829.html



Regards,
 Stefan

Lourival Júnior wrote:
> Hi Stefan,
> 
> Sorry I don't found the mail that you related :(.
> 
> Look at this shell script (I'm using the Cygwin in Windows 2000):
> 
> #!/bin/bash
> 
> # Set JAVA_HOME to reflect your systems java configuration
> export JAVA_HOME=/cygdrive/c/Arquivos\ de\ programas/Java/jre1.5.0
> 
> # Start index updation
> bin/nutch generate crawl-LEGISLA/db crawl-LEGISLA/segments -topN 1000
> s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
> echo Segment is $s
> bin/nutch fetch $s
> bin/nutch updatedb crawl-LEGISLA /db $s
> bin/nutch analyze crawl-LEGISLA /db 5
> bin/nutch index $s
> bin/nutch dedup crawl-LEGISLA /segments crawl-LEGISLA/tmpfile
> 
> # Merge segments to prevent too many open files exception in Lucene
> bin/nutch mergesegs -dir crawl-LEGISLA/segments -i -ds
> s=`ls -d crawl-LEGISLA/segments/2* | tail -1`
> echo Merged Segment is $s
> 
> rm -rf crawl-LEGISLA/index
> 
> I found it in the wiki page of the nutch project. It has some errors in
> execution time. I don't know if is it correct... Do you have other example
> of how to do this job?
> 
> On 6/9/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote:
>>
>> Lourival Júnior wrote:
>> > Hi all!
>> >
>> > I have some problems with update my WebDB. I've a page, test.htm, that
>> > has 4
>> > links to 4 pdf's documents. I execute the crawler then when I do this
>> > command:
>> >
>> > bin/nutch readdb Mydir/db -stats
>> >
>> > I get this output:
>> >
>> > Number of pages: 5
>> > Number of links: 4
>> >
>> > That's ok. The problem is when I add more 4 links to the test.htm. I
>> want a
>> > script that re crawl or update my WebDB without I have to delete Mydir
>> > folder. I hope I am being clearly.
>> > I found some shell scripts to do this, however it's don't do what I
>> want.
>> > Always I get the same number of pages and links.
>> >
>> > Anyone can help me?
>>
>> Hi,
>>
>> please re-read from the mailinglist-archives as of ... hmm ... yesterday
>> I think. You'll have to do a small modification to be able to re-inject
>> your URL to start re-crawling it on the next run. Otherwise a page will
>> only be re-crawled after a configurable amount of days, which is the
>> same value also used for the PDFs.
>>
>>
>> Regards,
>> Stefan


_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Adding new urls in WebDB

Reply via email to