Ben Ogle wrote: >Hi all, our organization is using nutch on a documentation intranet that >changes every now and then. To keep the index up to date, we are recrawling >the whole thing every night. For an intranet this seems to be a workaround >at best. Our nutch crawler is on the same server as our content and a >simpler solution, IMO, would be to monitor file system events and just >recrawl the necessary pages each time something changes. That way our index >would always be up to date and there would be no reason to do a brute force >recrawl every night. I am willing to write this functionality and contribute >it to the community as I believe other organizations could benefit from this >as well, but since I am not as familiar with nutch as some of the folks >here, I have a few questions. > >- Is this a solution to a nonexistent problem? >
I don't think there is any standardized way to do this yet. So every step into this direction would be a great improvement. > I mean, is there a nice >solution using the tools already provided? > not that I am aware of, but I guess other people have tackled this as well. I think it would be nice to generate a RSS or something similar as fetchlist which could also be accessed by other crawlers > I know each page is time stamped >in the database when it is fetched, but does this correspond to the last >modified date? > > I am still not sure if Nutch is actually comparing the last modifieds. I know there exists something called "adddays", but this is more to postpone re-crawling for e.g. 30 days >- Could this be done by using the existing generate/fetch/update cycle with >a index update? Is there a way to just fetch and index the pages necessary? >I suppose my tool could generate the fatch list(s) (I need to look into this >more closely). > >- Are there any other libraries like JNotify to implement this functionality >that anyone knows about? I haven't found any others. > > does JNotify also implement protocols, e.g. HTTP? In order to notify accross networks, or does it only work locally? Thanks Michi >Any input/suggestions/additional questions/whatever on this subject is >appreciated as I would like to come up with a more optimal solution for us >intranet nutch users. > >Ben > > -- Michael Wechner Wyona - Open Source Content Management - Apache Lenya http://www.wyona.com http://lenya.apache.org [EMAIL PROTECTED] [EMAIL PROTECTED] +41 44 272 91 61 ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
