Re: [gradle-dev] Strategy for minimising network traffic during dependency resolution.

Adam Murdoch Thu, 29 Mar 2012 16:31:57 -0700

On 30/03/2012, at 8:35 AM, Daz DeBoer wrote:

> After a little pondering, I'd favour an approach that is simple to describe 
> and doesn't result in unexpected behaviour; I think an extra HEAD request 
> here or there is ok.
> 
> How about we perform a HEAD request if we have any cache candidates, be they 
> local files or previous accesses to this URL.
> So the logic would be:
> Do we have any cache candidates? 
> If not, just HTTP GET the resource and we're done.
> HTTP HEAD to get the resource meta-data (and possibly the SHA1)
> If we got a 404, the resource is missing, we're done.
> If we match a cached URL resource, just use it and we're done.


Just to clarify:

* If the sha1 of the cached resource == the sha1 from the response, use the 
cached resource. Skip if the response has no sha1.
* If the etag of the cached resource == the etag from the response, use the 
cached resource. Skip if the etag is null for either.
* If the (content-length, last-modified-date) of the cached resource == the 
(content-length, last-modified-date) from the response, use the cached 
resource. Skip if content-length or last-modified-date is null for either.

We also only need to do the HTTP GET SHA1 if we don't already have the sha1 
from the HEAD request.


> If published SHA1 was found and matches then we can cache the URL resource 
> and we're done. 
> HTTP GET the actual resource
> Pros:
> - We can get the SHA1 from the headers if available, and avoid the GET-SHA1 
> call.
> - If a local file matches, we can cache the URL resolution as if we did an 
> HTTP GET, since we have the full HTTP headers + the content. We never have a 
> cached resource without an origin.
> - After initially using a file from say .m2/repo to satisfy a request, from 
> then on it will be just like we actually downloaded it from the URL. So there 
> are no residual effects of using a local file in place of a downloaded one. 
> Use of local files is a pure optimisation.
> - If the artifact is missing altogether, we get a single 404 for the HEAD, 
> rather than 404 for the SHA1 + 404 for the GET
> - It's simpler to understand, I think.
> 
> Cons:
> - If we have a candidate local file but SHA1 isn't published, we'll do an 
> extra HEAD request. ie HEAD URL + GET SHA1 + GET URL, rather than just GET 
> SHA1 + GET URL
> 
> Thoughts?

Looks good. A couple of comments:

* It bothers me a little that an unchanged (timestamp + length) can 
short-circuit the SHA1 check, as SHA1 is a better indication that something has 
changed. But I think I can live with this.
* We need to deal with the case where a server does not handle the HEAD request 
(e.g. stuff hosted at googlecode).


> Daz
> 
> On 29 March 2012 11:00, Luke Daley <[email protected]> wrote:
> Hi all,
> 
> As previously discussed, we are now leveraging last modified and content 
> length values to avoid downloading changing artifacts (resources really) that 
> have not changed. Currently, our strategy is the following…
> 
> Given an artifact id (group, name, version) and a repository:
> 
> 1. See if we have resolved this artifact from this repository previously. if 
> so and if the cache entry has not expired, use the cached resource. Otherwise:
> 3. Search the local file system in a bunch of places (e.g. maven local, old 
> gradle caches, the current filestore) for anything that was resolved with the 
> same artifact id effectively
> 2. Convert the request into a url to hit
> 4. Search the cache index for a record of the metadata for this url
> 
> So we now may have 0..n “locally available resource candidates” that we think 
> may be the same as what's behind the URL, and possibly a “cached external 
> resource” (a record of the metadata last time we hit the resource and it's 
> location in the filestore).
> 
> The fetch process looks like this:
> 
> * If there are any locally available resource candidates, fetch the remote 
> sha1 for the resource if it's available.
> * If any of the locally available resource candidates have the same checksum, 
> use that instead of downloading the resource (at the cost of not obtaining 
> metadata such as last modified, etag etc).
> * If not, or if there was no remote checksum available;
> * If we have a cached version of the resource, compare the cached metadata 
> with the real metadata via a HEAD request (implies that there was no remote 
> checksum in practice).
> ** If the metadata is unchanged (compare last modified date and content 
> length for equality), use the cached version (including metadata).
> ** If the metadata is changed, issue a GET to download the resource (then 
> cache the resource of course)
> 
> I think this is the practical thing to do, but probably not theoretically 
> correct.
> 
> The issue is that by using the checksum check to determine if something has 
> changed or not, we lose any cached metadata about the resource. If we find 
> something on the filesystem with the same checksum, all we can really assume 
> is that that file has the same binary content. We cannot assume that it came 
> from the same URL which should invalidate any cached metadata we had for that 
> URL. However, since the only metadata items that we care about are content 
> length, last modified and etag, if the checksum hasn't changed we could 
> probably assume that these values haven't changed either.
> 
> Furthermore, it probably doesn't matter because if there are remote checksums 
> for a resource available then we aren't really going to use the metadata for 
> anything.
> 
> Further furthermore, our current strategy is optimised for the case where 
> checksums are available which is considered best practice. If we flipped it 
> around and compared metadata first…
> 
> Pros:
> * If the item is unchanged, we only have one HEAD request as opposed to the 
> GET on the checksum (faster)
> * We maintain cached metadata “integrity”
> 
> Cons:
> * If the item has changed, we have one HEAD for the metadata (to determine it 
> was changed) then another GET for the sha1 (to look for locally available 
> resources)
> 
> Keep in mind, the con there is the rare case. This means that the external 
> resource has changed since the last time we saw it, but something else (i.e. 
> maven, older gradle version) has downloaded it in the meantime.
> 
> Under this (metadata first) strategy, the requests for a 
> seen-before-but-changed resource would look like this:
> 
> * HEAD to resource (get metadata) - determine changed
> * GET to checksum - most likely outcome is that we don't find a local version 
> of this
> * GET to resource
> 
> Under the current (checksum first) strategy it looks like this:
> 
> * GET to checksum - no local version found with checksum
> * GET to resource
> 
> Under this (metadata first) strategy, the requests for a 
> seen-before-but-UNchanged resource would look like this:
> 
> * HEAD to resource (get metadata) - determine unchanged
> 
> Under the current (checksum first) strategy it looks like this:
> 
> * GET to checksum - local version found with checksum (can't guarantee it 
> came from the same URL)
> 
> 
> Still following? :)
> 
> For me this comes down to:
> 
> * Is there a noticeable benefit of one HEAD request over one GET (for a sha1 
> text file), if not then we don't change. If so,
> * Do we optimise for the case where the resource is unchanged?
> 
> 
> There's another interesting option. Some servers send an “X-checksum-SHA1” 
> header (e.g. Artifactory). In this case, we could use this when performing 
> the initial HEAD and get the best of both worlds. Other servers advertise 
> that their etags are SHA1s (e.g. Nexus). We could use this metadata, and keep 
> the extra sha1 request as a fallback.
> 
> --
> Luke Daley
> Principal Engineer, Gradleware
> http://gradleware.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe from this list, please visit:
> 
>    http://xircles.codehaus.org/manage_email
> 
> 
> 
> 
> 
> -- 
> Darrell (Daz) DeBoer
> Principal Engineer, Gradleware 
> http://www.gradleware.com
> 


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Re: [gradle-dev] Strategy for minimising network traffic during dependency resolution.

Reply via email to