On 28/09/2011, at 7:40 PM, Hans Dockter wrote: > > > On Wed, Sep 28, 2011 at 7:03 AM, Adam Murdoch <[email protected]> > wrote: > > On 26/09/2011, at 8:22 PM, Luke Daley wrote: > >> >> On 26/09/2011, at 6:34 AM, Adam Murdoch wrote: >> >>> When resolving a dependency descriptor: >>> * We look for a matching dependency in the cache for each resolver. We >>> don't invoke the resolvers at this stage. If a matching cached value is >>> found that has not expired, we use the cached value, and stop looking. >>> * Otherwise, we attempt to resolve the dependency using each resolver in >>> turn. We stop on the first resolver that can resolve the dependency. >>> * If this fails, and an expired entry was found in the cache, use that >>> instead. >>> * We remember which resolver found the module, for downloading the >>> artifacts later. >>> >>> When resolving an artifact, we delegate directly to the resolver where the >>> artifact's module was found. >>> >>> Also, we apply the same expiry time for all dynamic revisions. This >>> includes snapshot revisions, changing = true, anything that matches a >>> changing pattern, version ranges, dynamic revisions (1.2+, etc) and >>> statuses (latest.integration). When we resolve a dynamic dependency >>> descriptor, we persist the module that we ended up resolving to, and we use >>> that value until the expiry is reached. >>> >>> Some implications of this: >>> >>> * We're making a performance-accuracy trade-off here. Which means we'll >>> probably need some way to tweak the behaviour. Not sure exactly how this >>> might look, yet. I'd start with a simple time-to-live property on each >>> configuration, and let the use cases drive anything beyond that. >>> >>> * For dynamic revisions, stopping on the first resolver means we may miss a >>> newer revision that happens to be in a later repository. An alternate >>> approach might be to use all resolvers for dynamic revisions, but only when >>> there is no unexpired value in the cache. We could do this search in >>> parallel, and just pick the latest out of those we have found at the end of >>> some timeout. Perhaps we could do the search in parallel for all revisions, >>> dynamic or not. >> >> If that timeout as expired, we should try all resolvers I think. >> >> If parallel is achievable in our timeframes then I can't see a reason not to. > > Ivy uses a nice sprinkling of static state, which means it's not going to be > easy to do parallel resolution. There are a couple of options, however. > > One option is to offer parallel resolution for those resolver implementations > that we provide, as we can make sure they don't use any static state. That > is, for file and http ivy and maven repositories that you define via > repositories.maven(), ivy(), mavenCentral() and mavenLocal(), plus whatever > we add in the future, we could potentially do parallel resolution. > > This is a potentially a lot of work, which helps only those builds that need > to use multiple remote repositories. So, the other option is a more general > solution, which helps all builds: > > We could start doing dependency resolution parallel to executing tasks, so > that, for example, while the unit tests for gradle core are executing, we can > be resolving and downloading the compile classpath for gradle launcher. We'd > still do dependency resolution in a single thread. It would just be a > separate thread to that executing the tasks. > > There are a few ways to approach this. I think about this as making the DAG > finer grained. Currently, each node in the DAG is really made up of a few > separate pieces of work: resolving any external artifacts that make up the > input files, checking whether the outputs of the task are up-to-date wrt its > inputs, and finally executing the actions of the task. I think we should bust > the task nodes up into separate nodes, each with their own dependencies. > Here's an example: > > task testCompile(type: CompileJava) { > classpath = configurations.testCompile + sourceSets.main.output > } > > execute testCompile -> build the inputs of testCompile > build the inputs of testCompile -> build configurations.testCompile, build > sourceSets.main.output > build configurations.testCompile -> the jar tasks from any project > dependencies in configurations.testCompile > build sourceSets.main.output -> the compileJava task > > A node would be available for execution as soon as all its dependencies have > been executed. One thread would execute available task nodes, the other would > "execute" available file collection nodes, that is, resolve external > dependencies. So, in our example above, once the jars for our project > dependencies have been built, we can execute the compileJava task and resolve > configurations.testCompile in parallel. > > What is interesting about this approach is that it is a nice step towards > parallel task execution. There are a few technical hurdles we need to tackle > before we can go fully parallel for tasks (there will be others, of course): > * Dependency resolution is not thread safe > * Incremental build does not work across multiple processes > * Our progress reporting does not understand multiple things happening at the > same time > * Same for profiling > * We need to be able to re-order task execution, but we don't know which > tasks we can safely shuffle around > > By splitting out dependency resolution and up-to-date checking into separate > DAG nodes that are executed by a separate thread, we can defer solving the > first 2 issues, but still do things in parallel. Doing things in parallel > will force us to start tackling the last 3 issues. Later, we can add > additional task execution threads, or additional worker processes, each with > a dependency resolution and task execution thread. > >> >> Also, we could save some time if we can specify that some repositories will >> never have newer snapshot versions. There is no point checking maven central >> for a newer version of the same version number of anything if it is already >> cached. However, this isn't likely to offer much of a saving on small - >> medium projects. > > This is a good point. Some repositories have constraints on what is published > there. eg don't go looking for snapshots at all in maven central. > >> >> Have we considered mapping dependencies to specific repositories? i.e. >> specifying that dependencies can only come from certain repositories would >> certainly make resolution faster for large projects but is less convenient. >> Perhaps it could be optional, with the default behaviour being that a >> dependency will be searched for in each repository. Another way to achieve >> this might be to have include/exclude patterns on repositories that are >> checked before attempting to search it for a particular artifact. > > This is certainly an option. > > We will eventually enable this feature in a post-1.0 release. But the > scenario for this would not be performance improvement. It would be rather > explicitness and exactness. For example if there is just one dependency you > retrieve from a special repo it would nice to be explicit about it and not > create the idea that this repository is another general repo. Or you have > some legacy repo that has a lot of crap that conflicts with your other repo > but you still need it for some stuff. You might want to isolate the crap. > > Performance wise we want to improve and there is some stuff we should do. But > I wouldn't go crazy here. I mean this is mostly painful right now because > there are unnecessary network lookups for dynamic revisions. Once this is > fixed we are talking about the usually few builds where dynamic revisions are > timed out in the cache and need to be retrieved/rechecked. Plus you can > always use a repository manager to make this very efficient. Repository > managers allow you to create any number of virtual repositories that are an > arbitrary aggregation of multiple physical ones. That way you only need one > and only one repo from a Gradle perspective.
Exactly right. I'd rather add ways to limit dependencies to a given set of repositories if it were for correctness reasons, rather than performance reasons. I think we can get good enough performance out of the box, without people needing to tweak the search algorithm. -- Adam Murdoch Gradle Co-founder http://www.gradle.org VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting http://www.gradleware.com
