29.06.2015, 16:04, A. Soroka wrote:

1) ETag-per-Dataset: this is a single ETag value for any Dataset for
all requests, updated whenever a mutating request completes. This
would work by letting any change on a Dataset whatsoever that comes
through Fuseki invalidate all ETag-based caching on that Dataset.
This seems to be where Andy Seaborne and Rob Vesse were heading, but
I obviously can't speak for them. Advantage: relatively simple.
Disadvantages: changes in the indexes not performed by Fuseki will
not be reflected properly, only useful for instances that receive
the right patterns of changes (meaning for which mutations aren't
too "evenly sprinkled" amongst queries, thus keeping the cache often
 invalidated).

+1 for either this, or a Last-Modified header which works the same way
(a per-dataset timestamp that is updated on any change). ETag is more
opaque than a Last-Modified timestamp; a timestamp might allow some more
intelligent choices to be made by a cache, as it can see whether the
data is only seconds old, vs. hours or days.

We currently use a Varnish cache in front of Fuseki (see [1] for
details), which is configured for a long expiry time. We manually
invalidate the Varnish cache after any updates to Fuseki data. This
means that frequently occurring queries will be answered by Varnish
directly without going to Fuseki at all. This works and performs well,
but the downside is having to do the manual invalidation, which then
throws away the whole cache. We usually only update the data once per
day so this is OK for now.

But having Fuseki respond with ETag/Last-Modified would enable another
mode of operation, which might be suitable for more dynamic data. In
this model, Varnish (or another HTTP cache such as nginx) would keep the
data for a shorter period (a few minutes), during which it would serve
cached responses without consulting Fuseki. After this period, it would
still keep the data, but when a new query comes in, it would ask Fuseki
whether its cached data is still valid based on ETag or Last-Modified. If it's still valid, it could keep serving it for a few more minutes etc. This would still be much better than asking Fuseki every time, and also better than throwing away the data completely after a few minutes, which are currently the main options besides using a long expiry time and manual invalidation.

So I'd definitely consider using this if Fuseki gets the support.

2) Constant Expires: Rob Vesse discusses this a bit in the issue.
It's an Expires header that is configurable to allow some admin
adjustment, but is constant during runtime. Advantage: dead simple.
Disadvantage: unless the usage scenario is very tightly controlled,
there's going to be some leakage of stale data. That may or may not
be a big problem for an integrator, depending on use case. It would
have to be carefully documented, I think, to avoid nasty surprises.

This is compatible with 1). I'd give it a +1 too, although it's not as
important as 1). We currently set the long expiry time in Varnish
configuration, but it would be more elegant to be able to do this in
Fuseki as a per-dataset, constant-during-runtime option. Apache
mod_expires [2] does something similar and I've used that to set expiry
times for static content. It works very well so I'd recommend looking at
mod_expires documentation for inspiration.

-Osma


[1] https://github.com/NatLibFi/Skosmos/wiki/FusekiTuning#http-caching

[2] http://httpd.apache.org/docs/2.2/mod/mod_expires.html

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
[email protected]
http://www.nationallibrary.fi

Reply via email to