Hello, I think this is a great idea. I am very interested in figuring out how to most efficiently index sites under these conditions. If anyone has tips, please do post them to the list... I would certainly support changes to make this easier!
btw, I also get apache module compilation error under freebsd (will submit bug report) --mark B. On Thu, 14 Mar 2002 05:38:54 +0100, Ole Tange wrote: > I am indexing sites that have some pages that are changed every hour and > some pages that are changed every year. Mostly the pages that change often > seems to be the same pages. I would imagine that this is common for most > of the world wide web, and we should use that knowledge to optimize the > Period (time to next re-indexing). > > So what I am proposing is using exponentially backoff to dynamically > determine the "right" period for every page. It could be done as follows: > > MinPeriod 1h > MaxPeriod 30d > > Time: > 1: Index page. > Set reindex_at = date+ > max(MinPeriod,min(MaxPeriod, date-last_changed)) > reindex_at: Re-index page. > If page changed: Set last_changed = date > Set reindex_at = date+ > max(MinPeriod,min(MaxPeriod, date-last_changed)) > > This will double the time between reindexing as long as the page is not > changed (Though at most MaxPeriod). If the page is changed then the time > between reindexing will be MinPeriod. > > If the page has not been changed for the last 2 minutes/hours/days/weeks > chances are that it will not be changed in the next 2 > minutes/hours/days/weeks either. > > A minor problem is when a URL that has been unmodified for a year starts > to get modified a lot. This is where MaxPeriod will kick in, so the page > _will_ be indexed once in a while. > > > /Ole > >
