Gianni- Here's the recrawl script that Jacob mentioned: http://wiki.apache.org/nutch/IntranetRecrawl [Note: There are 0.7.x and 0.8 versions]
Jacob- I noticed that the 0.8 script had an issue with after merging too. After it merges the segments, it fails to remove all the segments that it used to create the merged segment. (I think that's why there are all these comments about it filling up your disk, and recommending that you rm your segments and perform a periodic recrawl from scratch...) I changed this line after the mergesegs: for segment in `ls -d $segments_dir/* | tail -$depth` to: for segment in `ls -d $segments_dir/*` [Note: No need for that for loop, if you don't care to print out the segments you are removing, instead you can just make it 'rm -rf $segments_dir/*'] Am I missing something? It looks like the mergesegs call is on all the segments so it seems right to nuke the segments folder contents before moving in the resulting merged segment. Jared- -----Original Message----- From: Jacob Brunson [mailto:[EMAIL PROTECTED] Sent: Thursday, September 21, 2006 12:57 PM To: [email protected] Subject: Re: Automatic crawling On 9/21/06, Gianni Parini <[EMAIL PROTECTED]> wrote: > -Is it possible to have an automatic recrawling? have i got to write > my own application by myself? I need an application running in > background that re-crawl my intranet site 2-3 times a week.. On the nutch wiki you will find an intranet recrawl script. That probably will work for you. However, I think the script has a problem with duplicating segment data during the mergesegs step, but I've asked about it here and haven't had any confirmations. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
