Ben Ogle wrote:

>Hi all, our organization is using nutch on a documentation intranet that
>changes every now and then. To keep the index up to date, we are recrawling
>the whole thing every night. For an intranet this seems to be a workaround
>at best. Our nutch crawler is on the same server as our content and a
>simpler solution, IMO, would be to monitor file system events and just
>recrawl the necessary pages each time something changes. That way our index
>would always be up to date and there would be no reason to do a brute force
>recrawl every night. I am willing to write this functionality and contribute
>it to the community as I believe other organizations could benefit from this
>as well, but since I am not as familiar with nutch as some of the folks
>here, I have a few questions.
>
>- Is this a solution to a nonexistent problem?
>

I don't think there is any standardized way to do this yet. So every 
step into this
direction would be a great improvement.

> I mean, is there a nice
>solution using the tools already provided?
>

not that I am aware of, but I guess other people have tackled this as well.

I think it would be nice to generate a RSS or something similar as 
fetchlist which
could also be accessed by other crawlers

> I know each page is time stamped
>in the database when it is fetched, but does this correspond to the last
>modified date? 
>  
>

I am still not sure if Nutch is actually comparing the last modifieds. I 
know there exists something called
"adddays", but this is more to postpone re-crawling for e.g. 30 days

>- Could this be done by using the existing generate/fetch/update cycle with
>a index update? Is there a way to just fetch and index the pages necessary?
>I suppose my tool could generate the fatch list(s) (I need to look into this
>more closely).
>
>- Are there any other libraries like JNotify to implement this functionality
>that anyone knows about? I haven't found any others.
>  
>

does JNotify also implement protocols, e.g. HTTP? In order to notify 
accross networks,
or does it only work locally?

Thanks

Michi

>Any input/suggestions/additional questions/whatever on this subject is
>appreciated as I would like to come up with a more optimal solution for us
>intranet nutch users.
>
>Ben
>  
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to