Gianni-
Here's the recrawl script that Jacob mentioned:
http://wiki.apache.org/nutch/IntranetRecrawl
[Note: There are 0.7.x and 0.8 versions]

Jacob-
I noticed that the 0.8 script had an issue with after merging too.
After it merges the segments, it fails to remove all the segments that
it used to create the merged segment.  (I think that's why there are all
these comments about it filling up your disk, and recommending that you
rm your segments and perform a periodic recrawl from scratch...)

I changed this line after the mergesegs:
for segment in `ls -d $segments_dir/* | tail -$depth`

to:
for segment in `ls -d $segments_dir/*`

[Note: No need for that for loop, if you don't care to print out the
segments you are removing, instead you can just make it 'rm -rf
$segments_dir/*']

Am I missing something? It looks like the mergesegs call is on all the
segments so it seems right to nuke the segments folder contents before
moving in the resulting merged segment.

Jared-

-----Original Message-----
From: Jacob Brunson [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 21, 2006 12:57 PM
To: [email protected]
Subject: Re: Automatic crawling

On 9/21/06, Gianni Parini <[EMAIL PROTECTED]> wrote:
>         -Is it possible to have an automatic recrawling? have i got to
write
> my own application by myself? I need an application running in
> background that re-crawl my intranet site 2-3 times a week..

On the nutch wiki you will find an intranet recrawl script.  That
probably will work for you.  However, I think the script has a problem
with duplicating segment data during the mergesegs step, but I've
asked about it here and haven't had any confirmations.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to