Thanks for the code! It is indeed very simple! That?s why I like Cocoon :) Regarding the Last-Modified header, the getLastModified() do work for GET request, but the GET request also brings the whole document and not just the headers. That?s why I was observing the whole document being transferred all the time.
Ah, of course. Now it's obvious :) The getLastModified() is only for Cocoon's pipeline caching as it is assumed that the pipeline processing is the most time consuming part. Of course this changes fast if you fetch the content from remote.
So what is the best scenario for the HTMLGenerator? Always do a HEAD request to see if the remote document is modified and if it is, make a subsequent GET request OR always make a GET on every request ? It depends of the size of the document and the modification frequency. If the remote document is too large, it is inefficent to make a GET all the time, as the HTMLGenerator does today. On the other hand, if the document is modified frequently, it would be inefficient to make HEAD and GET request, since it means making two connections to the remote site.Using a sitemap parameter specifying the interval that the HTMLGenerator would fectch data would address both issues. Do you think it is worthy to change the current HTMLGenerator to include this extra parameter?
Definitely not as this problem is not HTMLGenerator specific, but URLSource specific. So I will raise this question also on the dev list, maybe someone has a clever proposal for this.
For the devs with clever ideas here's the thread (unfortunately RES breaks the thread view at marc.theaimsgroup.com, so switching to gmane.org):
http://thread.gmane.org/gmane.text.xml.cocoon.user/34445
Joerg
