On 07/10/2010, at 7:06 AM, Steve Appling wrote: > As part of trying to optimize my build, I have been trying to figure out how > much time was spent by the up-to-date checks of inputs/outputs for each task. > I just realized that tasks which are actually executed, but which set the > didWork flag to false are marked as UP-TO-DATE. I can't distinguish between > tasks where execution was skipped and tasks which are executed, but didn't > end up doing anything.
If you really need to, you can add a TaskExecutionListener and a TaskActionListener to the Gradle instance. You'll then have the information you need to distinguish between the two. > > Should tasks that do no work be marked as skipped and up-to-date? They > aren't really skipped, should they really be marked that way? If they are > marked as skipped, can we use another skip message to distinguish the two > cases? To the user of the build, there's no real difference. What would the different message be? Why would a user care? Having said that, it is potentially an important distinction for a build author. Our API should reflect the difference. Perhaps we change the semantics of TaskState so that: * Change task.state.skipped so that it is true only if none of the actions of the task were executed. * Rename task.state.skipMessage to something like statusMessage, to reflect the fact that a task can have a final status regardless of whether it was skipped or not. > > AbstractTask.setDidWork was never promoted to the Task interface. I think > perhaps it should be. I'm setting this now from some DefaultTasks, but > hadn't realized that this is not really public. I guess it should be. > > If I can distinguish better between input/output checking and tasks that did > no work, I might have some better numbers, but it looks to me like a good > portion of my build is spent doing the input/output up-to-date checks. This > is particularly expensive for tasks whose inputs are configurations that > contain large numbers of jars. Hashing the contents of large files like this > is expensive. I think it would be nice to be able to choose another strategy > for these checks (like comparing modification times and file size). We actually do persist the modification time and file size, as well as the hash. If a file's modification time and size have not changed, we assume that the hash has not changed either. I would argue that that using modification time alone, without the hash, actually has a good chance of making the average build time longer. The reason for this is that you lose the 'short-circuiting' that hashes gives you. Without hashes, you depend on every task in the build being well behaved, and not touching files which it don't need to change. If you have a single task in a dependency chain which always touches a file, you end up executing every subsequent task in that dependency chain. If hashes are used, then only that task executes, and anything after that is short-ciruited. In a good sized build, I think it is almost a certainty that you will have a task that does not behave well wrt incremental build (eg almost every ant task in existence). There are some other interesting things we can do with hashes that we can't do with timestamps. For example: * Say we were to introduce the concept of a classpath. Then, when we do an up-to-date check for a classpath provided to a task, we can use the hashes and ignore the file names. For example, if my build uses System.currentTime in the version number, my jars end up with different names each time I run a build, but often the content is the same. Using a hash means we can detect this and skip the task. We could not reliably assume this if we were to use file size + timestamp alone. * Say we were to introduce distributed testing. We can use the hashes to determine what has changed since last time the tests were run on the remote machine, and just send the diffs. Same for the results coming back. * As above for remote deployment. Or distributed builds. * Say we wanted to extend the idea of incremental build, so that developer machines could share artifacts built from the same source. Why should I build 50 project dependencies on my machine when someone else has already built them? Gradle should just grab them off the other machine for me. Hashes would be needed to make this work. You get the idea. I think hashes == good and timestamps == pointless There are certainly things we can do to make the up-to-date detection faster. Some possibilities: * In the daemon, we can use file system notifications to do the change detection in the background as files change. * Better caching of file system scans. We scan the file system more often than we need to. * For project dependencies (or maybe for any kind of artifact dependency), we could think about ignoring the intermediate files, such as .class files. We might just check whether the source files have changed, and that the jar file still exists and has the same content, and short-circuit the entire project. * Stream the history for a task with large number of inputs or outputs, and stop as soon as we find something out of date. Also, incremental build is essentially about getting the average build time down. There might be better performance gains elsewhere: * Making configuration faster. For example, we might come up with some way to do partial configuration, where only the build scripts for projects that are actually required are executed. * Making dependency resolution faster. Again, we can do things like use the daemon to resolve in the background, so that the configurations are pre-resolved when the build executes. Plus there's lots of caching we could potentially do there. * Making compilation faster. * Parallel builds, distributed testing, distributed builds. I don't care if up-to-date checking is fast or slow if I can throw 15 machines at the build. -- Adam Murdoch Gradle Developer http://www.gradle.org CTO, Gradle Inc. - Gradle Training, Support, Consulting http://www.gradle.biz
