|
How about an option to allow the server to index pages in a
filesystem rather than through http? For instance there could be a config option
that maps a URL to a directory so for instance:
http://www.mydomain.com/ =
/home/httpd/docs/index.html
This was the results can be stored as the proper url even
though all of the indexing took place without having to put any strain on the
web server...
It would still function the same way by spidering the site
instead of getting a listing of files (which might include files that the
maintainer does not want to be seen). To build on this even further you can have
it setup so that this works for only certain file types (like htm, html, txt...)
and if it comes across a file type that isn't listed in that config option it
will fall back to retrieving it using http instead. This would be useful in
order to avoid indexing a dynamic page such as shtml or php which might be
including other files or database queries.
So the positive effect that this feature would have is
conserving cpu since the web server does not get involved (and does not log
unnecessary stats). Also it can be flexibel enough to still retrieve dynamic
pages via the web like normal. This would also benefit my other goal of getting
last-modified times from dynamic pages that otherwise don't include this in the
http header since it can now just check the file systems last mod time instead.
Not to mention that this could be a good way of indexing pages or data that
don't even fall under a web server.
--Brett
|
