Re: [gradle-dev] dependency resolution algorithm

Adam Murdoch Tue, 27 Sep 2011 22:04:15 -0700

On 26/09/2011, at 8:22 PM, Luke Daley wrote:

> 
> On 26/09/2011, at 6:34 AM, Adam Murdoch wrote:
> 
>> When resolving a dependency descriptor:
>> * We look for a matching dependency in the cache for each resolver. We don't 
>> invoke the resolvers at this stage. If a matching cached value is found that 
>> has not expired, we use the cached value, and stop looking.
>> * Otherwise, we attempt to resolve the dependency using each resolver in 
>> turn. We stop on the first resolver that can resolve the dependency.
>> * If this fails, and an expired entry was found in the cache, use that 
>> instead.
>> * We remember which resolver found the module, for downloading the artifacts 
>> later.
>> 
>> When resolving an artifact, we delegate directly to the resolver where the 
>> artifact's module was found.
>> 
>> Also, we apply the same expiry time for all dynamic revisions. This includes 
>> snapshot revisions, changing = true, anything that matches a changing 
>> pattern, version ranges, dynamic revisions (1.2+, etc) and statuses 
>> (latest.integration). When we resolve a dynamic dependency descriptor, we 
>> persist the module that we ended up resolving to, and we use that value 
>> until the expiry is reached.
>> 
>> Some implications of this:
>> 
>> * We're making a performance-accuracy trade-off here. Which means we'll 
>> probably need some way to tweak the behaviour. Not sure exactly how this 
>> might look, yet. I'd start with a simple time-to-live property on each 
>> configuration, and let the use cases drive anything beyond that.
>> 
>> * For dynamic revisions, stopping on the first resolver means we may miss a 
>> newer revision that happens to be in a later repository. An alternate 
>> approach might be to use all resolvers for dynamic revisions, but only when 
>> there is no unexpired value in the cache. We could do this search in 
>> parallel, and just pick the latest out of those we have found at the end of 
>> some timeout. Perhaps we could do the search in parallel for all revisions, 
>> dynamic or not.
> 
> If that timeout as expired, we should try all resolvers I think.
> 
> If parallel is achievable in our timeframes then I can't see a reason not to.


Ivy uses a nice sprinkling of static state, which means it's not going to be 
easy to do parallel resolution. There are a couple of options, however.

One option is to offer parallel resolution for those resolver implementations 
that we provide, as we can make sure they don't use any static state. That is, 
for file and http ivy and maven repositories that you define via 
repositories.maven(), ivy(), mavenCentral() and mavenLocal(), plus whatever we 
add in the future, we could potentially do parallel resolution.

This is a potentially a lot of work, which helps only those builds that need to 
use multiple remote repositories. So, the other option is a more general 
solution, which helps all builds:

We could start doing dependency resolution parallel to executing tasks, so 
that, for example, while the unit tests for gradle core are executing, we can 
be resolving and downloading the compile classpath for gradle launcher. We'd 
still do dependency resolution in a single thread. It would just be a separate 
thread to that executing the tasks.

There are a few ways to approach this. I think about this as making the DAG 
finer grained. Currently, each node in the DAG is really made up of a few 
separate pieces of work: resolving any external artifacts that make up the 
input files, checking whether the outputs of the task are up-to-date wrt its 
inputs, and finally executing the actions of the task. I think we should bust 
the task nodes up into separate nodes, each with their own dependencies. Here's 
an example:

task testCompile(type: CompileJava) {
    classpath = configurations.testCompile + sourceSets.main.output
}

execute testCompile -> build the inputs of testCompile
build the inputs of testCompile -> build configurations.testCompile, build 
sourceSets.main.output
build configurations.testCompile -> the jar tasks from any project dependencies 
in configurations.testCompile
build sourceSets.main.output -> the compileJava task

A node would be available for execution as soon as all its dependencies have 
been executed. One thread would execute available task nodes, the other would 
"execute" available file collection nodes, that is, resolve external 
dependencies. So, in our example above, once the jars for our project 
dependencies have been built, we can execute the compileJava task and resolve 
configurations.testCompile in parallel.

What is interesting about this approach is that it is a nice step towards 
parallel task execution. There are a few technical hurdles we need to tackle 
before we can go fully parallel for tasks (there will be others, of course):
* Dependency resolution is not thread safe
* Incremental build does not work across multiple processes
* Our progress reporting does not understand multiple things happening at the 
same time
* Same for profiling
* We need to be able to re-order task execution, but we don't know which tasks 
we can safely shuffle around

By splitting out dependency resolution and up-to-date checking into separate 
DAG nodes that are executed by a separate thread, we can defer solving the 
first 2 issues, but still do things in parallel. Doing things in parallel will 
force us to start tackling the last 3 issues. Later, we can add additional task 
execution threads, or additional worker processes, each with a dependency 
resolution and task execution thread.

> 
> Also, we could save some time if we can specify that some repositories will 
> never have newer snapshot versions. There is no point checking maven central 
> for a newer version of the same version number of anything if it is already 
> cached. However, this isn't likely to offer much of a saving on small - 
> medium projects.

This is a good point. Some repositories have constraints on what is published 
there. eg don't go looking for snapshots at all in maven central.

> 
> Have we considered  mapping dependencies to specific repositories? i.e. 
> specifying that dependencies can only come from certain repositories would 
> certainly make resolution faster for large projects but is less convenient. 
> Perhaps it could be optional, with the default behaviour being that a 
> dependency will be searched for in each repository. Another way to achieve 
> this might be to have include/exclude patterns on repositories that are 
> checked before attempting to search it for a particular artifact.

This is certainly an option.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Re: [gradle-dev] dependency resolution algorithm

Reply via email to