I haven't analysed this yet. I will in the next few days.

As a first step, are we leaving the change in or taking it out for 1.0?

On 22/03/2012, at 7:53 PM, Adam Murdoch <[email protected]> wrote:

> 
> On 23/03/2012, at 3:48 AM, Daz DeBoer wrote:
> 
>> On 22 March 2012 05:02, Luke Daley <[email protected]> wrote:
>> 
>> On 22/03/2012, at 7:15 AM, Daz DeBoer wrote:
>>>  I guess one question is whether the lastModified date can actually be 
>>> considered some sort of meta-data about the artifact itself, or if it's 
>>> actually more like meta-data about the artifact retrieval. I kind of think 
>>> the lastModified date should be only used when checking if a particular URL 
>>> retrieval is up-to-date: and that this value should not leak into the model 
>>> further.
>> 
>> I've got no problem conceptualising it as meta data about the resolution. 
>> Say if we provided some kind of report on the contents of the cache relevant 
>> to the particular build you are working on. I can see providing the url and 
>> last modified of the artifacts when they were resolved in a particular 
>> repository being useful information.
>> 
>>> At the moment, if we get the same artifact from 2 different URLs in the 
>>> same repository, it looks like we will overwrite the first retrieval 
>>> (lastModified + url) when we cache the second. This doesn't feel great to 
>>> me.
>> 
>> For the “by-repository” cache, yes. But that in itself is interesting 
>> information (i.e. given the same repository and module vectors we got 
>> something at a different URL). It becomes more interesting if the repository 
>> is actually serving back redirects too. We could potentially then do some 
>> short circuiting here by using this info (need to think about that more).
>> 
>> We could be missing a concept here;
> 
> We are. There're two things here:
> 
> * Artifacts. This is a binary object that is published as part of a module.
> * Resources. This is a binary object at a given location.
> 
> An artefact has:
> * An artifact id.
> * A (file) resource that represents its location in the cache.
> * A resource that represents its origin.
> 
> A resource has:
> * A url that identifies the location of the resource
> * Some binary content
> * A last-modified-time.
> 
> These aren't the same things, and there isn't necessarily a one-to-one 
> relationship between these:
> * A resource does not necessarily represent an artefact. e.g., apply from: 
> 'some-url.gradle', copy { from 'some-url' }, etc.
> * There can be multiple candidate resources where an artefact can be found.
> * An artefact may not have any resource as its origin.
> * A resource can represent multiple different artefacts.
> 
> So, I think we want 2 separate caches here:
> * Given the URL of a resource, give me (last-modified, cached-resource) of 
> the resource.
> * Given an artefact id, give me the (origin-resource, cached-resource) of the 
> artefact.
> 
> 
> 
>> the separation of artifact meta-data (artifactId/file/sha1) from resolution 
>> meta-data (url/lastModified/timestamp). But looking deeper I don't think 
>> this was exacerbated by your recent changes at all, and I don't think this 
>> it's too big a deal.
>> 
>> Also looking deeper I see a potential issue when the artifact is not 
>> downloaded but we are using a previously downloaded artifact based on SHA1. 
>> In that case we will be (incorrectly) using the last modified date of the 
>> the cached file as the URL last modified date. This is because 
>> CachedHttpResource uses the file modification date as 'lastModified'. The 
>> only ways I can think to get around this are to 1) issue an extra HTTP HEAD 
>> request to get the correct lastModified date, 2) Assume a zero lastModified 
>> date for these artifacts, or 3) Store the last modified date in 'external' 
>> artifact caches in a way that it is available cross-version. As it stands, I 
>> think there are scenarios where this will produce incorrect artifact 
>> resolution.
> 
> 4) Don't associate a last-modified-date with an artefact. It belongs with the 
> origin resource, which in the case of something we find locally, is the local 
> resource, not the remote resource.
> 
> So, given the caches above, if I am resolving an artefact:
> 1. Look in the artefact cache for an entry with the given artefact id. If the 
> entry has not expired, use the corresponding cached-resource.
> 2. Resolver calculates a candidate origin url (or set of candidate origin 
> urls).
> 3. Look for local candidates for the given artefact id. If any found, go and 
> fetch the remote sha1. If any candidate matches the checksum, add an entry to 
> the artefact cache with origin-resource and cached-resource both pointing at 
> the local file (i.e. with url == url of local file, last-modified-time == 
> last modified time of local file). Return the cached-resource.
> 3. Look in the resource cache for an entry with the given origin url:
>     * If found, do a GET of the origin url, with if-modified-since set to the 
> last modified time from the cache entry.
>         * If we get an unmodified response, we can use the cached-resource 
> from the cache entry.
>         * If we get a modified response, download the content into the 
> filestore, and add an entry to the resource cache for the origin-url, with 
> last-modified-time from the http response, and cached-resource pointing at 
> the local file in the filestore. 
>     * Otherwise, do a GET of the origin url. Download the content into the 
> filestore. Add an entry to the resource cache, as above.
> 4. Add an entry to the artefact cache for the given artefact id, with 
> origin-resource pointing at the origin url, and cached-resource pointing to 
> the content in the file store. Return the cached-resource.
> 
> 
>> 
>> This has made me think of another thing: should --refresh-dependencies 
>> bypass the url resolution cache? We don't bypass the SHA1 checking when 
>> using '--refresh-dependencies', but that process is not temporally 
>> sensitive: it can only be screwed up if the SHA1 is published incorrectly. 
>> The caching of URL lastModified timestamps could be problematic if a new 
>> artifact was published on a URL, and the user wanted to switch back to an 
>> older one. If we think of '--refresh-dependencies' a bit like force-reload 
>> in a browser, then we probably want to bypass the url resolution cache with 
>> this switch.
> 
> It's not like a force-reload, I think, otherwise we should just download 
> everything again, regardless of whether checksums are available. The 
> semantics of --refresh-dependencies is to verify that the cached artefacts 
> are up-to-date. This should happen the same way that we check whether a 
> changing artefact has changed - however it is that this happens. So, all 
> --refresh-dependencies should do is skip step 1. above.
> 
> The question is, however, whether if-modified-since is the way to go here. 
> Perhaps we should do a HEAD request and compare the last modified time and 
> content length ourselves here. I think we should do this.
> 
> 
>> Implementation-wise this could be tricky, however.
>>  
>> I guess my opinion is that the shared abstraction is nice, and coherent in 
>> my opinion, but we are storing data that we are not using right now.
>> 
>> So our options are:
>> 
>> #1 - leave it the way it is and live with the cost of storing this extra 
>> metadata (i.e. cache file size and extra serialisation cost)
>> #2 - optimise to only store what is strictly needed and introduce more 
>> concepts and types
>> 
>> My vote is for 1, but given the criticality of this section of code I don't 
>> feel comfortable making that decision on my own.
>> 
>> My vote is for 1 as well. I'm not concerned about cache size or 
>> serialization cost.
> 
> My vote is #3: separate out the artefact and resource caches, and use a HEAD 
> request instead of a GET if-modified-since request.
> 
> 
> --
> Adam Murdoch
> Gradle Co-founder
> http://www.gradle.org
> VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
> http://www.gradleware.com
> 

Reply via email to