[gradle-dev] dependency resolution algorithm

Adam Murdoch Sun, 25 Sep 2011 22:34:53 -0700

Hi,

Currently, we use Ivy's ChainResolver to do 2 things:
* Resolve a dependency descriptor (org, module, revision, config constraints) 
to a module descriptor (more or less an in-memory ivy.xml).
* Resolve an artifact descriptor (org, module, revision, name, ext, etc) to a 
file.


To resolve a dependency descriptor, ChainResolver iterates over its resolvers, 
asking each resolver to resolve the dependency descriptor. It does not stop if 
a one of the resolvers happens to find the dependency and always continues to 
the end of the chain. Typically, the resolver will look first in its cache, and 
if not present, will hit the target repository (eg issue an HTTP GET request or 
whatever).

This used to work fine performance-wise when we were using the default Ivy 
cache, as each resolver shared the same cached meta-data, and once any resolver 
resolved the dependency, then every resolver would find the dependency in the 
cache from then on. Now that we cache the meta-data per-resolver, only those 
resolvers that actually resolve the dependency will later find them in the 
cache.

So, for example, say we have 2 remote repositories maven central and 
repo.gradle.org, and a module published only to repo.gradle.org and not maven 
central, say, gradle-tooling-api. Every time we try to resolve 
gradle-tooling-api, the ChainResolver asks maven central resolver to resolve 
it. The maven central resolver doesn't find it in the cache, and hits maven 
central to find it. The ChainResolve then asks repo.gradle.org resolver, which 
finds it in the cache and returns it. Net result is that we hit maven central 
at least once per build looking for a module that we've already found in 
repo.gradle.org.

ChainResolver does the same thing when resolving an artifact descriptor: Ask 
each resolver in turn. Each resolver looks in it's cache, and hits the 
repository if not found. And suffers from the same problem now that we cache 
artifacts per resolver. In particular, this makes the IDE tasks really, really 
slow, as we ask each resolver in turn whether it has the -sources.jar and the 
-src.jar and so on.

I'd like to replace this with the following:

When resolving a dependency descriptor:
* We look for a matching dependency in the cache for each resolver. We don't 
invoke the resolvers at this stage. If a matching cached value is found that 
has not expired, we use the cached value, and stop looking.
* Otherwise, we attempt to resolve the dependency using each resolver in turn. 
We stop on the first resolver that can resolve the dependency.
* If this fails, and an expired entry was found in the cache, use that instead.
* We remember which resolver found the module, for downloading the artifacts 
later.

When resolving an artifact, we delegate directly to the resolver where the 
artifact's module was found.

Also, we apply the same expiry time for all dynamic revisions. This includes 
snapshot revisions, changing = true, anything that matches a changing pattern, 
version ranges, dynamic revisions (1.2+, etc) and statuses 
(latest.integration). When we resolve a dynamic dependency descriptor, we 
persist the module that we ended up resolving to, and we use that value until 
the expiry is reached.

Some implications of this:

* We're making a performance-accuracy trade-off here. Which means we'll 
probably need some way to tweak the behaviour. Not sure exactly how this might 
look, yet. I'd start with a simple time-to-live property on each configuration, 
and let the use cases drive anything beyond that.

* For dynamic revisions, stopping on the first resolver means we may miss a 
newer revision that happens to be in a later repository. An alternate approach 
might be to use all resolvers for dynamic revisions, but only when there is no 
unexpired value in the cache. We could do this search in parallel, and just 
pick the latest out of those we have found at the end of some timeout. Perhaps 
we could do the search in parallel for all revisions, dynamic or not.

* We fetch artifacts only from the same repository as their module was found in 
(but this repository can have multiple patterns/base urls/etc). I think this is 
a good thing, from an accuracy/repeatability point of view.

* The fact that all dynamic revisions have the same time-to-live is a change in 
behaviour, but a good one, I think.

* Offline mode becomes cheap to implement. We just skip asking the resolvers. 
Plus, it makes offline mode less important, because we make more effort to use 
whatever is cached.

* Caching moves out of the resolvers and becomes a decoration that we apply 
consistently. This means less effort to implement a resolver in the future.

Thoughts? I want to get on to this as soon as milestone 5 is out.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

[gradle-dev] dependency resolution algorithm

Reply via email to