Re: [gradle-dev] dependency resolution algorithm

Adam Murdoch Thu, 29 Sep 2011 00:39:39 -0700

On 28/09/2011, at 7:40 PM, Hans Dockter wrote:

> 
> 
> On Wed, Sep 28, 2011 at 7:03 AM, Adam Murdoch <[email protected]> 
> wrote:
> 
> On 26/09/2011, at 8:22 PM, Luke Daley wrote:
> 
>> 
>> On 26/09/2011, at 6:34 AM, Adam Murdoch wrote:
>> 
>>> When resolving a dependency descriptor:
>>> * We look for a matching dependency in the cache for each resolver. We 
>>> don't invoke the resolvers at this stage. If a matching cached value is 
>>> found that has not expired, we use the cached value, and stop looking.
>>> * Otherwise, we attempt to resolve the dependency using each resolver in 
>>> turn. We stop on the first resolver that can resolve the dependency.
>>> * If this fails, and an expired entry was found in the cache, use that 
>>> instead.
>>> * We remember which resolver found the module, for downloading the 
>>> artifacts later.
>>> 
>>> When resolving an artifact, we delegate directly to the resolver where the 
>>> artifact's module was found.
>>> 
>>> Also, we apply the same expiry time for all dynamic revisions. This 
>>> includes snapshot revisions, changing = true, anything that matches a 
>>> changing pattern, version ranges, dynamic revisions (1.2+, etc) and 
>>> statuses (latest.integration). When we resolve a dynamic dependency 
>>> descriptor, we persist the module that we ended up resolving to, and we use 
>>> that value until the expiry is reached.
>>> 
>>> Some implications of this:
>>> 
>>> * We're making a performance-accuracy trade-off here. Which means we'll 
>>> probably need some way to tweak the behaviour. Not sure exactly how this 
>>> might look, yet. I'd start with a simple time-to-live property on each 
>>> configuration, and let the use cases drive anything beyond that.
>>> 
>>> * For dynamic revisions, stopping on the first resolver means we may miss a 
>>> newer revision that happens to be in a later repository. An alternate 
>>> approach might be to use all resolvers for dynamic revisions, but only when 
>>> there is no unexpired value in the cache. We could do this search in 
>>> parallel, and just pick the latest out of those we have found at the end of 
>>> some timeout. Perhaps we could do the search in parallel for all revisions, 
>>> dynamic or not.
>> 
>> If that timeout as expired, we should try all resolvers I think.
>> 
>> If parallel is achievable in our timeframes then I can't see a reason not to.
> 
> Ivy uses a nice sprinkling of static state, which means it's not going to be 
> easy to do parallel resolution. There are a couple of options, however.
> 
> One option is to offer parallel resolution for those resolver implementations 
> that we provide, as we can make sure they don't use any static state. That 
> is, for file and http ivy and maven repositories that you define via 
> repositories.maven(), ivy(), mavenCentral() and mavenLocal(), plus whatever 
> we add in the future, we could potentially do parallel resolution.
> 
> This is a potentially a lot of work, which helps only those builds that need 
> to use multiple remote repositories. So, the other option is a more general 
> solution, which helps all builds:
> 
> We could start doing dependency resolution parallel to executing tasks, so 
> that, for example, while the unit tests for gradle core are executing, we can 
> be resolving and downloading the compile classpath for gradle launcher. We'd 
> still do dependency resolution in a single thread. It would just be a 
> separate thread to that executing the tasks.
> 
> There are a few ways to approach this. I think about this as making the DAG 
> finer grained. Currently, each node in the DAG is really made up of a few 
> separate pieces of work: resolving any external artifacts that make up the 
> input files, checking whether the outputs of the task are up-to-date wrt its 
> inputs, and finally executing the actions of the task. I think we should bust 
> the task nodes up into separate nodes, each with their own dependencies. 
> Here's an example:
> 
> task testCompile(type: CompileJava) {
>     classpath = configurations.testCompile + sourceSets.main.output
> }
> 
> execute testCompile -> build the inputs of testCompile
> build the inputs of testCompile -> build configurations.testCompile, build 
> sourceSets.main.output
> build configurations.testCompile -> the jar tasks from any project 
> dependencies in configurations.testCompile
> build sourceSets.main.output -> the compileJava task
> 
> A node would be available for execution as soon as all its dependencies have 
> been executed. One thread would execute available task nodes, the other would 
> "execute" available file collection nodes, that is, resolve external 
> dependencies. So, in our example above, once the jars for our project 
> dependencies have been built, we can execute the compileJava task and resolve 
> configurations.testCompile in parallel.
> 
> What is interesting about this approach is that it is a nice step towards 
> parallel task execution. There are a few technical hurdles we need to tackle 
> before we can go fully parallel for tasks (there will be others, of course):
> * Dependency resolution is not thread safe
> * Incremental build does not work across multiple processes
> * Our progress reporting does not understand multiple things happening at the 
> same time
> * Same for profiling
> * We need to be able to re-order task execution, but we don't know which 
> tasks we can safely shuffle around
> 
> By splitting out dependency resolution and up-to-date checking into separate 
> DAG nodes that are executed by a separate thread, we can defer solving the 
> first 2 issues, but still do things in parallel. Doing things in parallel 
> will force us to start tackling the last 3 issues. Later, we can add 
> additional task execution threads, or additional worker processes, each with 
> a dependency resolution and task execution thread.
> 
>> 
>> Also, we could save some time if we can specify that some repositories will 
>> never have newer snapshot versions. There is no point checking maven central 
>> for a newer version of the same version number of anything if it is already 
>> cached. However, this isn't likely to offer much of a saving on small - 
>> medium projects.
> 
> This is a good point. Some repositories have constraints on what is published 
> there. eg don't go looking for snapshots at all in maven central.
> 
>> 
>> Have we considered  mapping dependencies to specific repositories? i.e. 
>> specifying that dependencies can only come from certain repositories would 
>> certainly make resolution faster for large projects but is less convenient. 
>> Perhaps it could be optional, with the default behaviour being that a 
>> dependency will be searched for in each repository. Another way to achieve 
>> this might be to have include/exclude patterns on repositories that are 
>> checked before attempting to search it for a particular artifact.
> 
> This is certainly an option.
> 
> We will eventually enable this feature in a post-1.0 release. But the 
> scenario for this would not be performance improvement. It would be rather 
> explicitness and exactness. For example if there is just one dependency you 
> retrieve from a special repo it would nice to be explicit about it and not 
> create the idea that this repository is another general repo. Or you have 
> some legacy repo that has a lot of crap that conflicts with your other repo 
> but you still need it for some stuff. You might want to isolate the crap.
> 
> Performance wise we want to improve and there is some stuff we should do. But 
> I wouldn't go crazy here. I mean this is mostly painful right now because 
> there are unnecessary network lookups for dynamic revisions. Once this is 
> fixed we are talking about the usually few builds where dynamic revisions are 
> timed out in the cache and need to be retrieved/rechecked. Plus you can 
> always use a repository manager to make this very efficient. Repository 
> managers allow you to create any number of virtual repositories that are an 
> arbitrary aggregation of multiple physical ones. That way you only need one 
> and only one repo from a Gradle perspective.


Exactly right. I'd rather add ways to limit dependencies to a given set of 
repositories if it were for correctness reasons, rather than performance 
reasons. I think we can get good enough performance out of the box, without 
people needing to tweak the search algorithm.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Re: [gradle-dev] dependency resolution algorithm

Reply via email to