On 23/03/2012, at 3:48 AM, Daz DeBoer wrote:

> On 22 March 2012 05:02, Luke Daley <[email protected]> wrote:
> 
> On 22/03/2012, at 7:15 AM, Daz DeBoer wrote:
>>  I guess one question is whether the lastModified date can actually be 
>> considered some sort of meta-data about the artifact itself, or if it's 
>> actually more like meta-data about the artifact retrieval. I kind of think 
>> the lastModified date should be only used when checking if a particular URL 
>> retrieval is up-to-date: and that this value should not leak into the model 
>> further.
> 
> I've got no problem conceptualising it as meta data about the resolution. Say 
> if we provided some kind of report on the contents of the cache relevant to 
> the particular build you are working on. I can see providing the url and last 
> modified of the artifacts when they were resolved in a particular repository 
> being useful information.
> 
>> At the moment, if we get the same artifact from 2 different URLs in the same 
>> repository, it looks like we will overwrite the first retrieval 
>> (lastModified + url) when we cache the second. This doesn't feel great to me.
> 
> For the “by-repository” cache, yes. But that in itself is interesting 
> information (i.e. given the same repository and module vectors we got 
> something at a different URL). It becomes more interesting if the repository 
> is actually serving back redirects too. We could potentially then do some 
> short circuiting here by using this info (need to think about that more).
> 
> We could be missing a concept here;

We are. There're two things here:

* Artifacts. This is a binary object that is published as part of a module.
* Resources. This is a binary object at a given location.

An artefact has:
* An artifact id.
* A (file) resource that represents its location in the cache.
* A resource that represents its origin.

A resource has:
* A url that identifies the location of the resource
* Some binary content
* A last-modified-time.

These aren't the same things, and there isn't necessarily a one-to-one 
relationship between these:
* A resource does not necessarily represent an artefact. e.g., apply from: 
'some-url.gradle', copy { from 'some-url' }, etc.
* There can be multiple candidate resources where an artefact can be found.
* An artefact may not have any resource as its origin.
* A resource can represent multiple different artefacts.

So, I think we want 2 separate caches here:
* Given the URL of a resource, give me (last-modified, cached-resource) of the 
resource.
* Given an artefact id, give me the (origin-resource, cached-resource) of the 
artefact.



> the separation of artifact meta-data (artifactId/file/sha1) from resolution 
> meta-data (url/lastModified/timestamp). But looking deeper I don't think this 
> was exacerbated by your recent changes at all, and I don't think this it's 
> too big a deal.
> 
> Also looking deeper I see a potential issue when the artifact is not 
> downloaded but we are using a previously downloaded artifact based on SHA1. 
> In that case we will be (incorrectly) using the last modified date of the the 
> cached file as the URL last modified date. This is because CachedHttpResource 
> uses the file modification date as 'lastModified'. The only ways I can think 
> to get around this are to 1) issue an extra HTTP HEAD request to get the 
> correct lastModified date, 2) Assume a zero lastModified date for these 
> artifacts, or 3) Store the last modified date in 'external' artifact caches 
> in a way that it is available cross-version. As it stands, I think there are 
> scenarios where this will produce incorrect artifact resolution.

4) Don't associate a last-modified-date with an artefact. It belongs with the 
origin resource, which in the case of something we find locally, is the local 
resource, not the remote resource.

So, given the caches above, if I am resolving an artefact:
1. Look in the artefact cache for an entry with the given artefact id. If the 
entry has not expired, use the corresponding cached-resource.
2. Resolver calculates a candidate origin url (or set of candidate origin urls).
3. Look for local candidates for the given artefact id. If any found, go and 
fetch the remote sha1. If any candidate matches the checksum, add an entry to 
the artefact cache with origin-resource and cached-resource both pointing at 
the local file (i.e. with url == url of local file, last-modified-time == last 
modified time of local file). Return the cached-resource.
3. Look in the resource cache for an entry with the given origin url:
    * If found, do a GET of the origin url, with if-modified-since set to the 
last modified time from the cache entry.
        * If we get an unmodified response, we can use the cached-resource from 
the cache entry.
        * If we get a modified response, download the content into the 
filestore, and add an entry to the resource cache for the origin-url, with 
last-modified-time from the http response, and cached-resource pointing at the 
local file in the filestore. 
    * Otherwise, do a GET of the origin url. Download the content into the 
filestore. Add an entry to the resource cache, as above.
4. Add an entry to the artefact cache for the given artefact id, with 
origin-resource pointing at the origin url, and cached-resource pointing to the 
content in the file store. Return the cached-resource.


> 
> This has made me think of another thing: should --refresh-dependencies bypass 
> the url resolution cache? We don't bypass the SHA1 checking when using 
> '--refresh-dependencies', but that process is not temporally sensitive: it 
> can only be screwed up if the SHA1 is published incorrectly. The caching of 
> URL lastModified timestamps could be problematic if a new artifact was 
> published on a URL, and the user wanted to switch back to an older one. If we 
> think of '--refresh-dependencies' a bit like force-reload in a browser, then 
> we probably want to bypass the url resolution cache with this switch.

It's not like a force-reload, I think, otherwise we should just download 
everything again, regardless of whether checksums are available. The semantics 
of --refresh-dependencies is to verify that the cached artefacts are 
up-to-date. This should happen the same way that we check whether a changing 
artefact has changed - however it is that this happens. So, all 
--refresh-dependencies should do is skip step 1. above.

The question is, however, whether if-modified-since is the way to go here. 
Perhaps we should do a HEAD request and compare the last modified time and 
content length ourselves here. I think we should do this.


> Implementation-wise this could be tricky, however.
>  
> I guess my opinion is that the shared abstraction is nice, and coherent in my 
> opinion, but we are storing data that we are not using right now.
> 
> So our options are:
> 
> #1 - leave it the way it is and live with the cost of storing this extra 
> metadata (i.e. cache file size and extra serialisation cost)
> #2 - optimise to only store what is strictly needed and introduce more 
> concepts and types
> 
> My vote is for 1, but given the criticality of this section of code I don't 
> feel comfortable making that decision on my own.
> 
> My vote is for 1 as well. I'm not concerned about cache size or serialization 
> cost.

My vote is #3: separate out the artefact and resource caches, and use a HEAD 
request instead of a GET if-modified-since request.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Reply via email to