[RT] More on caching, expires, and proxy-friendly headers

Gianugo Rabellino Tue, 11 Feb 2003 02:57:39 -0800

This RT integrates the one done more than one year ago and available at http://marc.theaimsgroup.com/?t=101074439900001&r=1&w=2.

As of now you know that we have a basic HTTP header control that mimics at a pipeline level the mod_expires functionality of the Apache HTTPD server. This was a good start, but now I feel it's time to refine it and make it better. Work is needed on two sides:

Proxy handling
==============

The approach to full proxy compliance should be done, once again :-), in microsteps. I've been reading the HTTP/1.1 specs and the proxy-related RFCs, and boy, it's not easy at all to implement a fully proxy compliant system. It can be done, but it requires serious thinking and a major rework of the request handling phase.

Full proxy compliance depends on the ability of dealing with conditional requests, handling a bunch of request headers all in some way interdependant and tricky to say the least. I'm not saying that we shouldn't do that sooner or later, but I'd rather plan this activity carefully, and possibily together with someone (Chuck?) from the httpd group working on the proxy part, in order to ensure that things work smoothly.

So, the first microstep is an easy one, just as a start. The companion to the expires header is the "Cache-Control" header (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9): this header allows for a finer grained control over the request, suggesting proxies what to with the results.

While Expires uses an HTTP date header, built in Cocoon by adding the result of the pipeline@expires attribute to the current system time, Cache-Control is somehow smarter, since it gives caches an hint on what is cacheable, how it should be cached (revalidated or not) and for how long in seconds. To make it short, my proposal is to add a Cache-Control header to any request coming from a pipeline with the "expires" attribute set with the following template:

Cache-Control: max-age={expires value in seconds}, public

The "public" keyword instructs the proxy to store a resource in its cache even if it should not be considered cacheable. This can be dangerous somehow, since the proxy will serve requests coming from "protected" resources without performing authentication on the origin server, but in the end I think that it's safe to assume that if a pipeline is marked with an "expires" header, than the user is perfectly aware that such resource can, and will, be cached.

The patch is a no-brainer, such as:

Index: src/java/org/apache/cocoon/components/pipeline/AbstractProcessingPipeline.java
===================================================================
RCS file: /home/cvs/xml-cocoon2/src/java/org/apache/cocoon/components/pipeline/AbstractProcessingPipeline.java,v
retrieving revision 1.33
diff -r1.33 AbstractProcessingPipeline.java
468a469
>
472c473,474
< res.setDateHeader("Expires", expires);
---
> res.setDateHeader("Expires", System.currentTimeMillis() + expires);
> res.setHeader("Cache-Control", "max-age=" + expires/1000 + ", public");
474c476
< new Long(expires));
---
> new Long(expires + System.currentTimeMillis()));
760c762
< return System.currentTimeMillis() + expires;
---
> return expires;

The only problem I see is that this header is not set under Tomcat (*argh*, Jetty works just OK!) so I have to investigate what's going wrong, but for the rest I'm ready to commit it if you agree on the idea (I'm reluctant to commit it right away since it somehow touches the pipeline core, where I almost never worked). Now for the second (and more interesting) point: Cocoon integration.

Cocoon integration
==================

The above approach works perfectly for communication with the external world, be it a reverse proxy or just a browser cache. Sometimes, however, there might be a case where you might want to use this concept internally: imagine to have an aggregation of different cocoon pipelines, where you have some resources for which you want to check validity strictly and some others that are pretty heavy to generate, uncacheable because the components you are using are not cacheable by themselves but on which you have full control on the expiration time. In this case, having an internal use of the expires attribute would be pretty useful, i.e.:

<pipeline internal-only="true">
<parameter name="expires" value="now plus 5 minutes"/>
<match pattern="my-heavy-resource">
<generate src="xmldb:xindice:///db/not/changing/frequently"/>
<serialize/>
</match>
</pipeline>

<pipeline internal-only="true">
<match pattern="my-dynamic-resource">
<generate src="/content/that/might/change"/>
<serialize/>
</match>
</pipeline>

<pipeline>
<match pattern="mybeautifulportal.html">
<aggregate element="portal">
<part src="cocoon://my-heavy-resource" element="news"/>
<part src="cocoon://my-dynamic-resource" element="data"/>
</aggregate>
<tranform src="myportal2html.xsl"/>
<serialize type="html"/>
</match>
</pipeline>

If we agree that this is useful, let's see the actual implementation. First, let's get back to the general principle: if a user sets an "expires" attribute on a pipeline, what she want's to say is "I know better than the Cocoon cache for how long this resource has to be considered fresh". This is by all means a configuration imposed by the user, to which the caching system should obey blindly. My opinion
then, wrt the caching pipeline, is that if an expires was set, all the pipeline engine should do is to check if the given resource has already been generated, and if the expiration time has not passed yet. If so, the resource should be considered fresh disregarding any Validity objects or Cacheable components.

This, AFAIU, would boost the performance even for internal pipelines and aggregation, and would let us use internal pipelines in a smarter and faster way. Not only that: if we are to use the expires feature even internally, Cocoon's performance will get a boost even without using a reverse proxy in front of the application server, since all the (potentially heavy) algorithms to check the resource's validity would be skipped.

Now for the implementation I wish I knew better the Cocoon caching internals, but from a quick read it seems to me that there should be:

- some logic in CachedResponse to store and get expires (easy);

- appropriate logic in the proper points to obtain the expires object from the environment and set a CachedResponse accordingly (is it enough to change CachingProcessingPipeline#cacheResults?

- more logic in the validatePipeline() method in AbstractCachingProcessingPipeline.java to take into account the expires object configured, if present.

- in all cases, all the algorithms that check if a cached entry is still valid, i.e. every place where a cache entry is built, validated or invalidated, should take into account the expires configuration.

I have started to play on this too, but I am wondering if I'm following the right path or if I'm missing something. Also, it might be worth considering to have a different CachingPipeline implementation (ExpiresEnabledCachingPipeline? Yuck ;-)), at least for a first start.

Comments and questions?

Ciao,

--
Gianugo Rabellino
Pro-netics s.r.l.
http://www.pro-netics.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

[RT] More on caching, expires, and proxy-friendly headers

Reply via email to