Ben Ogle wrote:

Hi all, our organization is using nutch on a documentation intranet that
changes every now and then. To keep the index up to date, we are recrawling
the whole thing every night. For an intranet this seems to be a workaround
at best. Our nutch crawler is on the same server as our content and a
simpler solution, IMO, would be to monitor file system events and just
recrawl the necessary pages each time something changes. That way our index
would always be up to date and there would be no reason to do a brute force
recrawl every night. I am willing to write this functionality and contribute
it to the community as I believe other organizations could benefit from this
as well, but since I am not as familiar with nutch as some of the folks
here, I have a few questions.

- Is this a solution to a nonexistent problem?


I don't think there is any standardized way to do this yet. So every step into this
direction would be a great improvement.

I mean, is there a nice
solution using the tools already provided?


not that I am aware of, but I guess other people have tackled this as well.

I think it would be nice to generate a RSS or something similar as fetchlist which
could also be accessed by other crawlers

I know each page is time stamped
in the database when it is fetched, but does this correspond to the last
modified date?

I am still not sure if Nutch is actually comparing the last modifieds. I know there exists something called
"adddays", but this is more to postpone re-crawling for e.g. 30 days

- Could this be done by using the existing generate/fetch/update cycle with
a index update? Is there a way to just fetch and index the pages necessary?
I suppose my tool could generate the fatch list(s) (I need to look into this
more closely).

- Are there any other libraries like JNotify to implement this functionality
that anyone knows about? I haven't found any others.

does JNotify also implement protocols, e.g. HTTP? In order to notify accross networks,
or does it only work locally?

Thanks

Michi

Any input/suggestions/additional questions/whatever on this subject is
appreciated as I would like to come up with a more optimal solution for us
intranet nutch users.

Ben


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61

Reply via email to