Hi all, our organization is using nutch on a documentation intranet that changes every now and then. To keep the index up to date, we are recrawling the whole thing every night. For an intranet this seems to be a workaround at best. Our nutch crawler is on the same server as our content and a simpler solution, IMO, would be to monitor file system events and just recrawl the necessary pages each time something changes. That way our index would always be up to date and there would be no reason to do a brute force recrawl every night. I am willing to write this functionality and contribute it to the community as I believe other organizations could benefit from this as well, but since I am not as familiar with nutch as some of the folks here, I have a few questions.
- Is this a solution to a nonexistent problem? I mean, is there a nice solution using the tools already provided? I know each page is time stamped in the database when it is fetched, but does this correspond to the last modified date? - Could this be done by using the existing generate/fetch/update cycle with a index update? Is there a way to just fetch and index the pages necessary? I suppose my tool could generate the fatch list(s) (I need to look into this more closely). - Are there any other libraries like JNotify to implement this functionality that anyone knows about? I haven't found any others. Any input/suggestions/additional questions/whatever on this subject is appreciated as I would like to come up with a more optimal solution for us intranet nutch users. Ben -- View this message in context: http://www.nabble.com/File-system-watching-for-intranets-tf2260463.html#a6271430 Sent from the Nutch - Dev forum at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
