Re: [gradle-dev] UP-TO-DATE optimization

Adam Murdoch Thu, 07 Oct 2010 17:39:07 -0700

On 07/10/2010, at 7:06 AM, Steve Appling wrote:

> As part of trying to optimize my build, I have been trying to figure out how 
> much time was spent by the up-to-date checks of inputs/outputs for each task. 
>  I just realized that tasks which are actually executed, but which set the 
> didWork flag to false are marked as UP-TO-DATE.  I can't distinguish between 
> tasks where execution was skipped and tasks which are executed, but didn't 
> end up doing anything.


If you really need to, you can add a TaskExecutionListener and a 
TaskActionListener to the Gradle instance. You'll then have the information you 
need to distinguish between the two.

> 
> Should tasks that do no work be marked as skipped and up-to-date?  They 
> aren't really skipped, should they really be marked that way?  If they are 
> marked as skipped, can we use another skip message to distinguish the two 
> cases?

To the user of the build, there's no real difference. What would the different 
message be? Why would a user care?

Having said that, it is potentially an important distinction for a build 
author. Our API should reflect the difference. Perhaps we change the semantics 
of TaskState so that:

* Change task.state.skipped so that it is true only if none of the actions of 
the task were executed.

* Rename task.state.skipMessage to something like statusMessage, to reflect the 
fact that a task can have a final status regardless of whether it was skipped 
or not.

> 
> AbstractTask.setDidWork was never promoted to the Task interface.  I think 
> perhaps it should be.  I'm setting this now from some DefaultTasks, but 
> hadn't realized that this is not really public.

I guess it should be.

> 
> If I can distinguish better between input/output checking and tasks that did 
> no work, I might have some better numbers, but it looks to me like a good 
> portion of my build is spent doing the input/output up-to-date checks.  This 
> is particularly expensive for tasks whose inputs are configurations that 
> contain large numbers of jars.  Hashing the contents of large files like this 
> is expensive.  I think it would be nice to be able to choose another strategy 
> for these checks (like comparing modification times and file size).

We actually do persist the modification time and file size, as well as the 
hash. If a file's modification time and size have not changed, we assume that 
the hash has not changed either.

I would argue that that using modification time alone, without the hash, 
actually has a good chance of making the average build time longer. The reason 
for this is that you lose the 'short-circuiting' that hashes gives you. Without 
hashes, you depend on every task in the build being well behaved, and not 
touching files which it don't need to change. If you have a single task in a 
dependency chain which always touches a file, you end up executing every 
subsequent task in that dependency chain. If hashes are used, then only that 
task executes, and anything after that is short-ciruited. In a good sized 
build, I think it is almost a certainty that you will have a task that does not 
behave well wrt incremental build (eg almost every ant task in existence).

There are some other interesting things we can do with hashes that we can't do 
with timestamps. For example:

* Say we were to introduce the concept of a classpath. Then, when we do an 
up-to-date check for a classpath provided to a task, we can use the hashes and 
ignore the file names. For example, if my build uses System.currentTime in the 
version number, my jars end up with different names each time I run a build, 
but often the content is the same. Using a hash means we can detect this and 
skip the task. We could not reliably assume this if we were to use file size + 
timestamp alone.

* Say we were to introduce distributed testing. We can use the hashes to 
determine what has changed since last time the tests were run on the remote 
machine, and just send the diffs. Same for the results coming back.

* As above for remote deployment. Or distributed builds.

* Say we wanted to extend the idea of incremental build, so that developer 
machines could share artifacts built from the same source. Why should I build 
50 project dependencies on my machine when someone else has already built them? 
Gradle should just grab them off the other machine for me. Hashes would be 
needed to make this work.

You get the idea. I think hashes == good and timestamps == pointless

There are certainly things we can do to make the up-to-date detection faster. 
Some possibilities:

* In the daemon, we can use file system notifications to do the change 
detection in the background as files change.

* Better caching of file system scans. We scan the file system more often than 
we need to.

* For project dependencies (or maybe for any kind of artifact dependency), we 
could think about ignoring the intermediate files, such as .class files. We 
might just check whether the source files have changed, and that the jar file 
still exists and has the same content, and short-circuit the entire project.

* Stream the history for a task with large number of inputs or outputs, and 
stop as soon as we find something out of date.

Also, incremental build is essentially about getting the average build time 
down. There might be better performance gains elsewhere:

* Making configuration faster. For example, we might come up with some way to 
do partial configuration, where only the build scripts for projects that are 
actually required are executed.

* Making dependency resolution faster. Again, we can do things like use the 
daemon to resolve in the background, so that the configurations are 
pre-resolved when the build executes. Plus there's lots of caching we could 
potentially do there.

* Making compilation faster.

* Parallel builds, distributed testing, distributed builds. I don't care if 
up-to-date checking is fast or slow if I can throw 15 machines at the build.


--
Adam Murdoch
Gradle Developer
http://www.gradle.org
CTO, Gradle Inc. - Gradle Training, Support, Consulting
http://www.gradle.biz

Re: [gradle-dev] UP-TO-DATE optimization

Reply via email to