[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465517 ]
Andrzej Bialecki commented on NUTCH-61: ---------------------------------------- Actually, there is a way to do this, and this patch implements it. We define a maximum "time to live" for _any_ page, no matter when it was last fetched or what is its re-fetch interval. This is a system-wide setting. If re-fetch interval is longer than this value, or somehow the page wasn't re-fetched at least that long for other reasons (e.g. because it was unmodified, and we don't fetch unmodified content) - such pages will be forcefully included in fetchlist candidates as if they had DB_UNFETCHED status. This means we can be sure that any pages still present in segments older than this maximum TTL will have been refetched, and we can safely discard all segments older than TTL. > Adaptive re-fetch interval. Detecting umodified content > ------------------------------------------------------- > > Key: NUTCH-61 > URL: https://issues.apache.org/jira/browse/NUTCH-61 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Attachments: 20050606.diff, 20051230.txt, 20060227.txt, > nutch-61-417287.patch > > > Currently Nutch doesn't adjust automatically its re-fetch period, no matter > if individual pages change seldom or frequently. The goal of these changes is > to extend the current codebase to support various possible adjustments to > re-fetch times and intervals, and specifically a re-fetch schedule which > tries to adapt the period between consecutive fetches to the period of > content changes. > Also, these patches implement checking if the content has changed since last > fetching; protocol plugins are also changed to make use of this information, > so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers