How about an option to allow the server to index pages in a filesystem rather than through http? For instance there could be a config option that maps a URL to a directory so for instance:
 
http://www.mydomain.com/ = /home/httpd/docs/index.html
 
This was the results can be stored as the proper url even though all of the indexing took place without having to put any strain on the web server...
 
It would still function the same way by spidering the site instead of getting a listing of files (which might include files that the maintainer does not want to be seen). To build on this even further you can have it setup so that this works for only certain file types (like htm, html, txt...) and if it comes across a file type that isn't listed in that config option it will fall back to retrieving it using http instead. This would be useful in order to avoid indexing a dynamic page such as shtml or php which might be including other files or database queries.
 
So the positive effect that this feature would have is conserving cpu since the web server does not get involved (and does not log unnecessary stats). Also it can be flexibel enough to still retrieve dynamic pages via the web like normal. This would also benefit my other goal of getting last-modified times from dynamic pages that otherwise don't include this in the http header since it can now just check the file systems last mod time instead. Not to mention that this could be a good way of indexing pages or data that don't even fall under a web server.
 
--Brett
 

Reply via email to