On 9/21/06, Jacob Brunson <[EMAIL PROTECTED]> wrote:
On 9/21/06, Gianni Parini <[EMAIL PROTECTED]> wrote:
> -Is it possible to have an automatic recrawling? have i got to write
> my own application by myself? I need an application running in
> background that re-crawl my intranet site 2-3 times a week..
On the nutch wiki you will find an intranet recrawl script. That
probably will work for you. However, I think the script has a problem
with duplicating segment data during the mergesegs step, but I've
asked about it here and haven't had any confirmations.
Well, I can confirm my index grew to ~5 GB from ~1.5 GB after (if I
remember correctly) 2 recrawls.
It doesn't solve the problem I was after anyway, as it only indexes
pages according to the time of the last crawl, rather than crawling
everything, checking if it the new content has a newer
modification/creation date and indexing only that (typical intranet
scenario). But I'm running like a madman in the opposite direction of
the topic: please ignore me. :)
t.n.a.