Adam Murdoch wrote:


Steve Appling wrote:
I am interested in ways to short circuit task execution for the purpose of optimization. I would love to see some of this in 0.7 and would be glad to contribute.

Here are some ideas:
1) Add an "onlyIf" method to Task that is given a closure. The closure would be executed before the first action of the task and would cancel execution of the task (with appropriate lifecycle message) if it returned false. This closure would have as a delegate an optimization container with some helper methods that would provide more convenient access to change detection (among other things). Then you could do:
  mytask.onlyIf {
    timestampChanged 'src/main/mysrc'
    // or contentsChanged 'src/main/mysrc'
  }


I think this is a good idea.

2) Running a clean should probably remove the change detection state information for a project (or at least the clean task should be able to be configured to do this conveniently).


I think the change detection mechanism should figure out that the output artifacts don't exist any more instead.

One thing that clean should arguably get rid of is the internal repository in $rootDir/.gradle. I wonder if it should also clean the buildSrc project?

3) I would like some general way for tasks to indicate that they did anything. Perhaps task.getDidWork(). BTW, I figured out how to do this for gradle's use of ant.javac and can now tell if it really compiled anything.


When you say 'it really compiled anything' do you mean you can tell whether the task decided to invoke javac or not?

Ant's javac scans the source and class files itself to see if any source files are newer than the corresponding class files. If so, it then calls Java's javac with this list of outdated files. After executing the gradle task, I can determine which files were actually passed to Java's javac by ant. For several types of tasks (compile, groovycompile, copy, directory, zip, jar, tar), the task is already doing its own optimization by comparing source timestamps to some target during execution. It is possible to execute the task without it having any side effects. Since most of them have the information about what they actually did, it seems better (and faster) to use this information instead of scanning source / output a second time externally to see what changed.


I think it would be better if Gradle could figure out whether a task did anything, rather than require the task writer to do anything.
I would like this, but I'm not sure how to accomplish it in the general case. Tasks may have input/output other than just a set of files (like network operations, web services calls, deploy over webdav). Even tasks like copy may do the work in a way that makes it hard to see what happened after the fact. I know that we have several tasks which have output that is put into the same directory with the output from other tasks. It would not be sufficient to just scan the output directories after each execution since they would also include the results from other tasks. If you never allow parallel execution, then you could scan the output directories both before and after a tasks execution, but this seems expensive. If the task already knows what it did, why not make use of that information.

For custom tasks (instances of DefaultTask), it seems simpler for a build writer to set some state to indicate if they did anything than to specify the set of files to check. If this check is best done by comparing files, then we should provide easy ways to call into the change detection code to set this state.

I think we could assume that if a task executes any task action, it has done work.
I don't think this is true. As I discussed above, there are many tasks (like compile) that execute their task action, but decide during execution to not cause any side effects.

If a task wants to do any short-circuiting, it would need to use an onlyIf() predicate. In addition, if we provided any easy way for a task to declare its output artifacts, then Gradle can additionally automatically apply change detection to these output artifacts in order to decide whether the task did any work.

So, instead of adding a Task.didWork property, perhaps we should merge this concept with the existing Task.executed property into a single read-only Task.state property with an enum with values something like: created, executed, or skipped.


I think you should be able to distinguish executed and did something from executed and didn't do anything.

4) I would like to be able to specify that a chain of dependent tasks only execute a task if Task.didWork is true for all of its dependents. Note that this is not always desired, so you need to be able to turn this on and off. I'm not sure of the best way to configure this. If we use the onlyIf method suggested above, it might take another closure to check this that would be returned from a "needed" method. This would look like:
  myTask.onlyIf(needed())

This probably should be the default for tests, but perhaps not for all Tasks.


I'm not sure about this approach.
After trying to implement some of this, I no longer like all of this approach either. I don't think there is anything appropriate to do "for a chain of dependent tasks". I do still like the general idea of onlyIf { isNeeded() }. I think that isNeeded may be a good place contain any mechanism for Gradle to automatically determine if artifacts it depends changed or tasks it depends on did work.


The tests should run if either the test classes or the classes under test have changed since last time we successfully ran the tests. Arguably a change to the test runtime classpath should also cause the tests to run. In other words, the tests should be run only if the input artifacts have not changed since last time we ran the tests. Checking whether all the dependencies of the test task have executed or not is only an approximation of this, and not a general solution. For example, if I assemble my classes under test using, say, 2 independent Compile tasks, then the test task should run if either task has done something. Or, I may assemble my classes using some other build tool, so that there's no task which we can use to check whether or not the classes have changed.

To me, the key to task optimisation is to base it on the input and output artifacts of a task. If we make it easy to declare both the input and output artifacts of a task, we make the model much richer, and from this we get a lot of goodness.

For example, if we know what the input artifacts for a task are, Gradle can apply change detection to those input artifacts on the task's behalf. If we also know which tasks produce those artifacts, then Gradle can optimise the change detection. Gradle could, for example, when it knows which task produces a given artifact, simply use the fact that the producer task executed an action or not to decide whether the input artifacts have changed, and only fall back to hashing or timestamps or a Java 7 file watcher or whatever when it doesn't know how the artifact is produced. Similarly, it could use the fact that a Jar was downloaded by the dependency management system to decide whether the input artifacts have changed.

Adding input and output artifacts to the model also lets us use this information to build the DAG, and to be smart about skipping tasks. For example, if the test task were to declare that it uses the tests classes directory and the test runtime configuration as input artifacts, then Gradle would be able to automatically add the tasks that produce these (if any) to the task dependencies of the test task.

Knowing which tasks produce and consume a given artifact also allows us to extract concurrency constraints from the model. If 2 tasks both contribute to the production of the same artifact (classes dir, say), they should not run concurrently. Or if 2 tasks both consume the same artifact, they should not run concurrently. And obviously a producer and consumer task for a given artifact should not run concurrently.

Extending this further, if we know the input and output artifacts of a task, or subgraph of tasks, we can distribute the work to remote machines.

I think it might be a good approach to first add support for the onlyIf clause and some helpers to allow manual use of optimization and then investigate techniques to allow Gradle to be smarter about this and do more automatically. If Gradle just adds optimization rules to tasks in the built in plugins and doesn't provide automated optimization for custom tasks you will still get a lot of benefit.

I generally like the idea of a richer model that has information about what each task consumes and produces, but I'm not clear exactly how this would be specified. I don't want to require the build writer to duplicate information about what the task inputs / outputs are. I would love to see some examples of how this would work for general tasks.

Javac is already checking to see if the source files are out of date with the classes, so I don't think that the javac task needs to use the new changedetection. This would, however let you stop other tasks in the chain (like test) if nothing needed to be compiled. (unrelated: I would also like to see an option on compile to use Ant's depend task. I think the current dependencyTracking option doesn't work with the modern compiler. )

Other types of tasks could make good use of Tom's change detection.

5) We probably want a command line option to be able to disable all of these optimizations. Sometimes you really want to force a build with no optimizations (without running clean).


In the race for speed, Gradle will probably never catch Ant in a clean build (at least while you are delegating most of the expensive stuff to ant).

I wonder. The richer our model, the more scope we have to optimise without the build script author or task author to doing anything special. We can automatically extract parallelism. We can inline and batch tasks. We can distribute bits of the build. We can reuse work that other machines have already done.


Adam


--
Steve Appling
Automated Logic Research Team

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

   http://xircles.codehaus.org/manage_email


Reply via email to