On 22/03/2012, at 7:53 PM, Adam Murdoch wrote:
>
> On 23/03/2012, at 3:48 AM, Daz DeBoer wrote:
>
>> On 22 March 2012 05:02, Luke Daley <[email protected]> wrote:
>>
>> On 22/03/2012, at 7:15 AM, Daz DeBoer wrote:
>>> I guess one question is whether the lastModified date can actually be
>>> considered some sort of meta-data about the artifact itself, or if it's
>>> actually more like meta-data about the artifact retrieval. I kind of think
>>> the lastModified date should be only used when checking if a particular URL
>>> retrieval is up-to-date: and that this value should not leak into the model
>>> further.
>>
>> I've got no problem conceptualising it as meta data about the resolution.
>> Say if we provided some kind of report on the contents of the cache relevant
>> to the particular build you are working on. I can see providing the url and
>> last modified of the artifacts when they were resolved in a particular
>> repository being useful information.
>>
>>> At the moment, if we get the same artifact from 2 different URLs in the
>>> same repository, it looks like we will overwrite the first retrieval
>>> (lastModified + url) when we cache the second. This doesn't feel great to
>>> me.
>>
>> For the “by-repository” cache, yes. But that in itself is interesting
>> information (i.e. given the same repository and module vectors we got
>> something at a different URL). It becomes more interesting if the repository
>> is actually serving back redirects too. We could potentially then do some
>> short circuiting here by using this info (need to think about that more).
>>
>> We could be missing a concept here;
>
> We are. There're two things here:
>
> * Artifacts. This is a binary object that is published as part of a module.
I'm not clear on the term “module” here. Is it as simple as the module being
"commons-lang:commons-lang" and the the artifact being the particular instance
of this module that is "commons-lang:commons-lang:2.4" ?
Or is it that an “artifact” is just a kind of more abstract description,
usually group/name/version, that ultimately resolves to some binary resource?
> * Resources. This is a binary object at a given location.
>
> An artefact has:
> * An artifact id.
> * A (file) resource that represents its location in the cache.
> * A resource that represents its origin.
Given just an artifact, I don't think we can talk about its origin. At least at
the moment we don't as we ask Gradle to effectively “select” the
origin/resource for an artifact. It's not really intrinsic to the artifact
itself.
Or should I just add the implication that you are talking about
artifact-at-repository here?
> A resource has:
> * A url that identifies the location of the resource
> * Some binary content
> * A last-modified-time.
>
> These aren't the same things, and there isn't necessarily a one-to-one
> relationship between these:
> * A resource does not necessarily represent an artefact. e.g., apply from:
> 'some-url.gradle', copy { from 'some-url' }, etc.
> * There can be multiple candidate resources where an artefact can be found.
> * An artefact may not have any resource as its origin.
> * A resource can represent multiple different artefacts.
>
> So, I think we want 2 separate caches here:
> * Given the URL of a resource, give me (last-modified, cached-resource) of
> the resource.
> * Given an artefact id, give me the (origin-resource, cached-resource) of the
> artefact.
Makes sense to me.
>> the separation of artifact meta-data (artifactId/file/sha1) from resolution
>> meta-data (url/lastModified/timestamp). But looking deeper I don't think
>> this was exacerbated by your recent changes at all, and I don't think this
>> it's too big a deal.
>>
>> Also looking deeper I see a potential issue when the artifact is not
>> downloaded but we are using a previously downloaded artifact based on SHA1.
>> In that case we will be (incorrectly) using the last modified date of the
>> the cached file as the URL last modified date. This is because
>> CachedHttpResource uses the file modification date as 'lastModified'. The
>> only ways I can think to get around this are to 1) issue an extra HTTP HEAD
>> request to get the correct lastModified date, 2) Assume a zero lastModified
>> date for these artifacts, or 3) Store the last modified date in 'external'
>> artifact caches in a way that it is available cross-version. As it stands, I
>> think there are scenarios where this will produce incorrect artifact
>> resolution.
>
> 4) Don't associate a last-modified-date with an artefact. It belongs with the
> origin resource, which in the case of something we find locally, is the local
> resource, not the remote resource.
>
> So, given the caches above, if I am resolving an artefact:
> 1. Look in the artefact cache for an entry with the given artefact id. If the
> entry has not expired, use the corresponding cached-resource.
> 2. Resolver calculates a candidate origin url (or set of candidate origin
> urls).
> 3. Look for local candidates for the given artefact id. If any found, go and
> fetch the remote sha1. If any candidate matches the checksum, add an entry to
> the artefact cache with origin-resource and cached-resource both pointing at
> the local file (i.e. with url == url of local file, last-modified-time ==
> last modified time of local file). Return the cached-resource.
> 3. Look in the resource cache for an entry with the given origin url:
> * If found, do a GET of the origin url, with if-modified-since set to the
> last modified time from the cache entry.
> * If we get an unmodified response, we can use the cached-resource
> from the cache entry.
> * If we get a modified response, download the content into the
> filestore, and add an entry to the resource cache for the origin-url, with
> last-modified-time from the http response, and cached-resource pointing at
> the local file in the filestore.
> * Otherwise, do a GET of the origin url. Download the content into the
> filestore. Add an entry to the resource cache, as above.
> 4. Add an entry to the artefact cache for the given artefact id, with
> origin-resource pointing at the origin url, and cached-resource pointing to
> the content in the file store. Return the cached-resource.
>
>
>> This has made me think of another thing: should --refresh-dependencies
>> bypass the url resolution cache? We don't bypass the SHA1 checking when
>> using '--refresh-dependencies', but that process is not temporally
>> sensitive: it can only be screwed up if the SHA1 is published incorrectly.
>> The caching of URL lastModified timestamps could be problematic if a new
>> artifact was published on a URL, and the user wanted to switch back to an
>> older one. If we think of '--refresh-dependencies' a bit like force-reload
>> in a browser, then we probably want to bypass the url resolution cache with
>> this switch.
>
> It's not like a force-reload, I think, otherwise we should just download
> everything again, regardless of whether checksums are available. The
> semantics of --refresh-dependencies is to verify that the cached artefacts
> are up-to-date. This should happen the same way that we check whether a
> changing artefact has changed - however it is that this happens. So, all
> --refresh-dependencies should do is skip step 1. above.
>
> The question is, however, whether if-modified-since is the way to go here.
> Perhaps we should do a HEAD request and compare the last modified time and
> content length ourselves here. I think we should do this.
>
>
>> Implementation-wise this could be tricky, however.
>>
>> I guess my opinion is that the shared abstraction is nice, and coherent in
>> my opinion, but we are storing data that we are not using right now.
>>
>> So our options are:
>>
>> #1 - leave it the way it is and live with the cost of storing this extra
>> metadata (i.e. cache file size and extra serialisation cost)
>> #2 - optimise to only store what is strictly needed and introduce more
>> concepts and types
>>
>> My vote is for 1, but given the criticality of this section of code I don't
>> feel comfortable making that decision on my own.
>>
>> My vote is for 1 as well. I'm not concerned about cache size or
>> serialization cost.
>
> My vote is #3: separate out the artefact and resource caches, and use a HEAD
> request instead of a GET if-modified-since request.
My gut feel is that we should sort this out now while it's fresh and we are
discussing it. It doesn't seem like a lot of work to me. Though of course, I
guess it can wait.
--
Luke Daley
Principal Engineer, Gradleware
http://gradleware.com