I haven't analysed this yet. I will in the next few days. As a first step, are we leaving the change in or taking it out for 1.0?
On 22/03/2012, at 7:53 PM, Adam Murdoch <[email protected]> wrote: > > On 23/03/2012, at 3:48 AM, Daz DeBoer wrote: > >> On 22 March 2012 05:02, Luke Daley <[email protected]> wrote: >> >> On 22/03/2012, at 7:15 AM, Daz DeBoer wrote: >>> I guess one question is whether the lastModified date can actually be >>> considered some sort of meta-data about the artifact itself, or if it's >>> actually more like meta-data about the artifact retrieval. I kind of think >>> the lastModified date should be only used when checking if a particular URL >>> retrieval is up-to-date: and that this value should not leak into the model >>> further. >> >> I've got no problem conceptualising it as meta data about the resolution. >> Say if we provided some kind of report on the contents of the cache relevant >> to the particular build you are working on. I can see providing the url and >> last modified of the artifacts when they were resolved in a particular >> repository being useful information. >> >>> At the moment, if we get the same artifact from 2 different URLs in the >>> same repository, it looks like we will overwrite the first retrieval >>> (lastModified + url) when we cache the second. This doesn't feel great to >>> me. >> >> For the “by-repository” cache, yes. But that in itself is interesting >> information (i.e. given the same repository and module vectors we got >> something at a different URL). It becomes more interesting if the repository >> is actually serving back redirects too. We could potentially then do some >> short circuiting here by using this info (need to think about that more). >> >> We could be missing a concept here; > > We are. There're two things here: > > * Artifacts. This is a binary object that is published as part of a module. > * Resources. This is a binary object at a given location. > > An artefact has: > * An artifact id. > * A (file) resource that represents its location in the cache. > * A resource that represents its origin. > > A resource has: > * A url that identifies the location of the resource > * Some binary content > * A last-modified-time. > > These aren't the same things, and there isn't necessarily a one-to-one > relationship between these: > * A resource does not necessarily represent an artefact. e.g., apply from: > 'some-url.gradle', copy { from 'some-url' }, etc. > * There can be multiple candidate resources where an artefact can be found. > * An artefact may not have any resource as its origin. > * A resource can represent multiple different artefacts. > > So, I think we want 2 separate caches here: > * Given the URL of a resource, give me (last-modified, cached-resource) of > the resource. > * Given an artefact id, give me the (origin-resource, cached-resource) of the > artefact. > > > >> the separation of artifact meta-data (artifactId/file/sha1) from resolution >> meta-data (url/lastModified/timestamp). But looking deeper I don't think >> this was exacerbated by your recent changes at all, and I don't think this >> it's too big a deal. >> >> Also looking deeper I see a potential issue when the artifact is not >> downloaded but we are using a previously downloaded artifact based on SHA1. >> In that case we will be (incorrectly) using the last modified date of the >> the cached file as the URL last modified date. This is because >> CachedHttpResource uses the file modification date as 'lastModified'. The >> only ways I can think to get around this are to 1) issue an extra HTTP HEAD >> request to get the correct lastModified date, 2) Assume a zero lastModified >> date for these artifacts, or 3) Store the last modified date in 'external' >> artifact caches in a way that it is available cross-version. As it stands, I >> think there are scenarios where this will produce incorrect artifact >> resolution. > > 4) Don't associate a last-modified-date with an artefact. It belongs with the > origin resource, which in the case of something we find locally, is the local > resource, not the remote resource. > > So, given the caches above, if I am resolving an artefact: > 1. Look in the artefact cache for an entry with the given artefact id. If the > entry has not expired, use the corresponding cached-resource. > 2. Resolver calculates a candidate origin url (or set of candidate origin > urls). > 3. Look for local candidates for the given artefact id. If any found, go and > fetch the remote sha1. If any candidate matches the checksum, add an entry to > the artefact cache with origin-resource and cached-resource both pointing at > the local file (i.e. with url == url of local file, last-modified-time == > last modified time of local file). Return the cached-resource. > 3. Look in the resource cache for an entry with the given origin url: > * If found, do a GET of the origin url, with if-modified-since set to the > last modified time from the cache entry. > * If we get an unmodified response, we can use the cached-resource > from the cache entry. > * If we get a modified response, download the content into the > filestore, and add an entry to the resource cache for the origin-url, with > last-modified-time from the http response, and cached-resource pointing at > the local file in the filestore. > * Otherwise, do a GET of the origin url. Download the content into the > filestore. Add an entry to the resource cache, as above. > 4. Add an entry to the artefact cache for the given artefact id, with > origin-resource pointing at the origin url, and cached-resource pointing to > the content in the file store. Return the cached-resource. > > >> >> This has made me think of another thing: should --refresh-dependencies >> bypass the url resolution cache? We don't bypass the SHA1 checking when >> using '--refresh-dependencies', but that process is not temporally >> sensitive: it can only be screwed up if the SHA1 is published incorrectly. >> The caching of URL lastModified timestamps could be problematic if a new >> artifact was published on a URL, and the user wanted to switch back to an >> older one. If we think of '--refresh-dependencies' a bit like force-reload >> in a browser, then we probably want to bypass the url resolution cache with >> this switch. > > It's not like a force-reload, I think, otherwise we should just download > everything again, regardless of whether checksums are available. The > semantics of --refresh-dependencies is to verify that the cached artefacts > are up-to-date. This should happen the same way that we check whether a > changing artefact has changed - however it is that this happens. So, all > --refresh-dependencies should do is skip step 1. above. > > The question is, however, whether if-modified-since is the way to go here. > Perhaps we should do a HEAD request and compare the last modified time and > content length ourselves here. I think we should do this. > > >> Implementation-wise this could be tricky, however. >> >> I guess my opinion is that the shared abstraction is nice, and coherent in >> my opinion, but we are storing data that we are not using right now. >> >> So our options are: >> >> #1 - leave it the way it is and live with the cost of storing this extra >> metadata (i.e. cache file size and extra serialisation cost) >> #2 - optimise to only store what is strictly needed and introduce more >> concepts and types >> >> My vote is for 1, but given the criticality of this section of code I don't >> feel comfortable making that decision on my own. >> >> My vote is for 1 as well. I'm not concerned about cache size or >> serialization cost. > > My vote is #3: separate out the artefact and resource caches, and use a HEAD > request instead of a GET if-modified-since request. > > > -- > Adam Murdoch > Gradle Co-founder > http://www.gradle.org > VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting > http://www.gradleware.com >
