Well reasoned out. I'm in favour of this. On 29/03/2012, at 10:35 PM, Daz DeBoer <[email protected]> wrote:
> After a little pondering, I'd favour an approach that is simple to describe > and doesn't result in unexpected behaviour; I think an extra HEAD request > here or there is ok. > > How about we perform a HEAD request if we have any cache candidates, be they > local files or previous accesses to this URL. > So the logic would be: > Do we have any cache candidates? > If not, just HTTP GET the resource and we're done. > HTTP HEAD to get the resource meta-data (and possibly the SHA1) > If we got a 404, the resource is missing, we're done. > If we match a cached URL resource, just use it and we're done. > If we have a local file candidate, HTTP GET SHA1. > If published SHA1 was found and matches then we can cache the URL resource > and we're done. > HTTP GET the actual resource > Pros: > - We can get the SHA1 from the headers if available, and avoid the GET-SHA1 > call. > - If a local file matches, we can cache the URL resolution as if we did an > HTTP GET, since we have the full HTTP headers + the content. We never have a > cached resource without an origin. > - After initially using a file from say .m2/repo to satisfy a request, from > then on it will be just like we actually downloaded it from the URL. So there > are no residual effects of using a local file in place of a downloaded one. > Use of local files is a pure optimisation. > - If the artifact is missing altogether, we get a single 404 for the HEAD, > rather than 404 for the SHA1 + 404 for the GET > - It's simpler to understand, I think. > > Cons: > - If we have a candidate local file but SHA1 isn't published, we'll do an > extra HEAD request. ie HEAD URL + GET SHA1 + GET URL, rather than just GET > SHA1 + GET URL > > Thoughts? > Daz > > On 29 March 2012 11:00, Luke Daley <[email protected]> wrote: > Hi all, > > As previously discussed, we are now leveraging last modified and content > length values to avoid downloading changing artifacts (resources really) that > have not changed. Currently, our strategy is the following… > > Given an artifact id (group, name, version) and a repository: > > 1. See if we have resolved this artifact from this repository previously. if > so and if the cache entry has not expired, use the cached resource. Otherwise: > 3. Search the local file system in a bunch of places (e.g. maven local, old > gradle caches, the current filestore) for anything that was resolved with the > same artifact id effectively > 2. Convert the request into a url to hit > 4. Search the cache index for a record of the metadata for this url > > So we now may have 0..n “locally available resource candidates” that we think > may be the same as what's behind the URL, and possibly a “cached external > resource” (a record of the metadata last time we hit the resource and it's > location in the filestore). > > The fetch process looks like this: > > * If there are any locally available resource candidates, fetch the remote > sha1 for the resource if it's available. > * If any of the locally available resource candidates have the same checksum, > use that instead of downloading the resource (at the cost of not obtaining > metadata such as last modified, etag etc). > * If not, or if there was no remote checksum available; > * If we have a cached version of the resource, compare the cached metadata > with the real metadata via a HEAD request (implies that there was no remote > checksum in practice). > ** If the metadata is unchanged (compare last modified date and content > length for equality), use the cached version (including metadata). > ** If the metadata is changed, issue a GET to download the resource (then > cache the resource of course) > > I think this is the practical thing to do, but probably not theoretically > correct. > > The issue is that by using the checksum check to determine if something has > changed or not, we lose any cached metadata about the resource. If we find > something on the filesystem with the same checksum, all we can really assume > is that that file has the same binary content. We cannot assume that it came > from the same URL which should invalidate any cached metadata we had for that > URL. However, since the only metadata items that we care about are content > length, last modified and etag, if the checksum hasn't changed we could > probably assume that these values haven't changed either. > > Furthermore, it probably doesn't matter because if there are remote checksums > for a resource available then we aren't really going to use the metadata for > anything. > > Further furthermore, our current strategy is optimised for the case where > checksums are available which is considered best practice. If we flipped it > around and compared metadata first… > > Pros: > * If the item is unchanged, we only have one HEAD request as opposed to the > GET on the checksum (faster) > * We maintain cached metadata “integrity” > > Cons: > * If the item has changed, we have one HEAD for the metadata (to determine it > was changed) then another GET for the sha1 (to look for locally available > resources) > > Keep in mind, the con there is the rare case. This means that the external > resource has changed since the last time we saw it, but something else (i.e. > maven, older gradle version) has downloaded it in the meantime. > > Under this (metadata first) strategy, the requests for a > seen-before-but-changed resource would look like this: > > * HEAD to resource (get metadata) - determine changed > * GET to checksum - most likely outcome is that we don't find a local version > of this > * GET to resource > > Under the current (checksum first) strategy it looks like this: > > * GET to checksum - no local version found with checksum > * GET to resource > > Under this (metadata first) strategy, the requests for a > seen-before-but-UNchanged resource would look like this: > > * HEAD to resource (get metadata) - determine unchanged > > Under the current (checksum first) strategy it looks like this: > > * GET to checksum - local version found with checksum (can't guarantee it > came from the same URL) > > > Still following? :) > > For me this comes down to: > > * Is there a noticeable benefit of one HEAD request over one GET (for a sha1 > text file), if not then we don't change. If so, > * Do we optimise for the case where the resource is unchanged? > > > There's another interesting option. Some servers send an “X-checksum-SHA1” > header (e.g. Artifactory). In this case, we could use this when performing > the initial HEAD and get the best of both worlds. Other servers advertise > that their etags are SHA1s (e.g. Nexus). We could use this metadata, and keep > the extra sha1 request as a fallback. > > -- > Luke Daley > Principal Engineer, Gradleware > http://gradleware.com > > > --------------------------------------------------------------------- > To unsubscribe from this list, please visit: > > http://xircles.codehaus.org/manage_email > > > > > > -- > Darrell (Daz) DeBoer > Principal Engineer, Gradleware > http://www.gradleware.com >
