Given the time of year, I'm afraid this message will fall on deaf
ears, but anyway ...
I was recently startled to discover that there's apparently no easy
way to perform a proper "conditional GET" [1] using Cocoon's sources.
I wonder: didn't anybody ever try to implement an RSS aggregator or
other kind of HTTP client that frequently requests seldom changing
Web resources? And if someone did, didn't he care about blindly
fetching the whole resource every time, even if not necessary?
Anyway, I just needed this and tried to see what could be done. And
of course, I wanted to exploit Cocoon's caching mechanism to store
the contents of already fetched resources. This turned out to be
harder than expected, due in part to the way checking the validity of
sources works, but mostly to my own ignorance of the subject.
First of all, I though that the best place to implement this behavior
was in a Source object. This seems to me to be the correct choice,
but it has one potentially negative side-effect. More on this later.
I also decided to exploit the SourceValidity interface. After all,
it's there for this very purpose. Unfortunately, this is where things
turned out to be not so simple. To understand why, here is a
description of how my first attempt worked:
1. A generator requests an HTTP resource.
2. A suitable factory provides a new instance of class HttpSource.
3. The new Source's getInputStream method is called: this uses
Jakarta Commons HttpClient to fetch the requested URL.
4. The new Source's getValidity method is called: this returns a new
HttpSourceValidity object containing the values from the Last-
modified and Etag response headers, if present.
5. The same HTTP resource is requested again.
6. The SourceValidity object associated with the previous request is
recovered and it's isValid method is called.
7. The HttpSourceValidity implementation of the method uses the
stored Last-modified and Etag values to perform a proper conditional
GET. Here, two things might happen:
8a. A "304 Not Modified" status is returned. isValid returns VALID
and Cocoon uses the cached version. Everybody is happy.
8b. A "200 OK" status is returned, as the original resource has
perhaps been modified. isValid returns INVALID and Cocoon calls the
Source's getInputStream method anew. Everybody is NOT happy, because
the original resource has been fetched twice: once by the
SourceValidity and once by the Source itself.
You see, the problem is that there's no easy way for the
SourceValidity to tell Cocoon that it should reuse what has just been
retrieved.
I could have used a HEAD request in the SourceValidity. This would
have saved some bandwidth but still the server would have had to
compute the response twice, if not particularly smart. And still,
doing two HTTP requests when one suffices does not seem quite optimal.
So I thought really hard about the problem and came up with a
(hopefully) brilliant solution: Use a ThreadLocal. The
HttpSourceValidity will store in a ThreadLocal the response data
(actually an instance of HttpClient's GetMethod class) and the
HttpSource will use it later, in the same request and hence in the
same thread, to provide an InputStream for reading.
I've provided a patch for this (see http://issues.apache.org/jira/
browse/COCOON-1726) against the 2.1 branch. Please have a look at it
(particularly the FIXME comments) as I would like some expert advice
on the implementation before finalizing it.
One problem that might arise is due to the fact that with the
cocoon.xconf settings included in the patch, all "http" URIs will be
served by this Source, overriding the default handling by Excalibur's
URLSource. This could change the behavior of existing applications,
but it would strike me as strange having to use some other pseudo-
protocol (cached-http ?).
Ugo
[1] http://fishbowl.pastiche.org/2002/10/21/
http_conditional_get_for_rss_hackers
--
Ugo Cei
Tech Blog: http://agylen.com/
Open Source Zone: http://oszone.org/
Wine & Food Blog: http://www.divinocibo.it/