Hello, I think this is a great idea.
I am very interested in figuring out how to most efficiently index sites
under these conditions.  If anyone has tips, please do post them to the
list...  I would certainly support changes to make this easier!

btw, I also get apache module compilation error under freebsd (will submit
bug report)

--mark B.

On Thu, 14 Mar 2002 05:38:54 +0100, Ole Tange wrote:
> I am indexing sites that have some pages that are changed every hour and
> some pages that are changed every year. Mostly the pages that change often
> seems to be the same pages. I would imagine that this is common for most
> of the world wide web, and we should use that knowledge to optimize the
> Period (time to next re-indexing).
> 
> So what I am proposing is using exponentially backoff to dynamically
> determine the "right" period for every page. It could be done as follows:
> 
> MinPeriod 1h
> MaxPeriod 30d
> 
> Time:
> 1:           Index page.
>              Set reindex_at = date+
>                  max(MinPeriod,min(MaxPeriod, date-last_changed))
> reindex_at:  Re-index page.
>              If page changed: Set last_changed = date
>              Set reindex_at = date+
>                  max(MinPeriod,min(MaxPeriod, date-last_changed))
> 
> This will double the time between reindexing as long as the page is not
> changed (Though at most MaxPeriod). If the page is changed then the time
> between reindexing will be MinPeriod.
> 
> If the page has not been changed for the last 2 minutes/hours/days/weeks
> chances are that it will not be changed in the next 2
> minutes/hours/days/weeks either.
> 
> A minor problem is when a URL that has been unmodified for a year starts
> to get modified a lot. This is where MaxPeriod will kick in, so the page
> _will_ be indexed once in a while.
> 
> 
> /Ole
> 
> 

Reply via email to