[gradle-dev] C++ concept: fine-grained dependencies and incremental build

Jay Berkenbilt Wed, 26 Dec 2012 08:58:08 -0800

In this post, I will discuss some of the issues surrounding file-level
dependencies and support of incremental builds.  Abuild's support for
this is pretty good -- probably as good as any make-based build system
can realistically be.  It's not as good as what scons has, at least
based on my reading, though I have not personally used scons.  Gradle
should be able to be as good as scons and better than abuild or make here.


Abuild's dependency management is based on file modification times,
which is true of most traditional C/C++ build systems.  In an ideal
world, the decision of whether a build is up to date or not is made
based on something like a digest of everything that goes into the
build.  This would include contents (not modification times) of files,
commands used to build, any parts of the environment that effect the
build, etc.  Realistically, this may be too hard since it's hard to know
ALL external factors may actually affect the build, but at least some
factors beyond modification time should be there.  Either way, you have
to be able to tell which files contribute to the creatin of each
artifact or intermediate build product.  As I reread this, I realize
I've completely glossed over one detail: a dependency of A on B should
mean that A is automatically generated and needs to be regenerated if B
changes.  It doesn't make much sense to talk about A depending on B if A
is not automatically generated.  (It's okay if B is also automatically
generated.)  However sometimes people are tempted to express
dependencies in terms of non-generated items and to use transitive
dependencies.  This may work in some cases and not in others.

Here are two important principles about figuring out what the
dependencies of a source file are.  I recommend following these two
principles unless there is a compelling reason not to.

 1.  Use the preprocessor to figure out what files go into compiling a
given source file.  Don't try to be smart about parsing the file
yourself to figure it out.  You will end up re-implementing the
preprocessor.  Remember that, in addition to any symbols you may define
in header files or on the command line (not a good idea, but people do
it all the time) the compiler may have many preprocessor symbols defined
internally or defined as part of the compiler's configuration.  The
actual set of files that are included can be conditional upon some macro
whose definition can be conditional upon some preprocessor symbol that's
not mentioned anywhere in your build.  You also have to know the include
path to resolve #includes since files with the same name may appear in
multiple places, making include path ordering significant.  You also
have to recursively expand all the dependencies for each object file. 
You can't try to take a short cut and use transitive dependencies.  You
may think you have something like a.o -> a.c -> a.h -> b.h, and you may
be tempted to stop your analysis at a.h if you already know that a.h ->
b.h, but you can't do that because a.h may include b.h conditionally
upon something #defined in a.c or in something else a.c includes.  The
only reliable way to know what files are going into a build is to run
the preprocessor on every single source file and look at what it does. 
Luckily there is a standard way to do this, which I'll discuss below. 
Some compilers support precompiled headers, which only works if your
headers aren't doing goofy stuff like what I described.  Maybe if you're
lucky enough to have these, you can exploit it in some way, but
generally, you're trading safety for speed, which is probably a bad
trade.  Especially on current systems, the preprocessor isn't going to
dominate the build time.

 2.  If a file depends on something that does not exist, this is not an
error.  Just treat the file as out of date.  This is really important. 
Suppose a.c #includes b.h which may include either c.h or d.h based on
some preprocessor symbol.  Now suppose you ran a build once, and the
dependency was on c.h, so you (correctly) recorded somewhere that a.c
depends on c.h even though you never explicitly include it.  Then later,
c.h disappears but also b.h changes to no longer reference it.  When you
check to see whether a.o is up to date,  you discover that a.o depends
on c.h, which doesn't exist.  No sweat.  This just means a.o is out of
date and needs to be recompiled from a.c.  When you do this, you'll get
a new list of dependencies.  Basically, if all of an item's dependencies
exist and are of the correct state (usually meaning that they are older
than the product, but ideally meaning that they are the same as they
were at the last build), then your item is up to date.  Otherwise, it is
not up to date.  In no case should inability to see a dependency be
treated as an error.  If the dependency is still really there and that
product is not going to be automatically generated, the build will still
fail, so you're not sacrificing any safety by letting it go at this
stage if it does happen to be a sign of an actual error.

The basic paradigm for generating file-based dependencies is to use the
preprocessor and to generate the dependency information at the time of
the build.  If you don't have pre-existing dependency information, then
the file should be considered out of date.  If you do, then the file is
only up to date if everything is correct.  Some compilers, like gcc, can
save you a step.  If you invoke gcc with -MD, it will spit out
makefile-style dependencies which you can use directly.  If you also
give -MP, then it will also spit out an empty dependency list for each
dependency, which implements the "non-existent dependency = out of date,
not error" principle for make.  Since gradle doesn't use make, you
probably don't care about -MP.  The output file generated by gcc -MD -MP
contains dependencies that can be directly parsed by make, but gradle
could also suck those in and use them in its own way.  Make uses them by
verifying that each dependency listed is older than the object file, but
gradle could be smarter and could cache the checksum of each dependency
at the time of the last build and then recheck that before the next
build.  Since gradle will also know how the compiler was invoked last
time and about to be invoked this time, gradle can do a much better
dependency check.  If it does this, I strongly recommend that there be a
debugging flag that enables gradle to tell the end user EXACTLY why it
is rebuilding something.  (For example, rebuilding a.o because
dependency q.h does not exist, dependency r.h has changed, and the
compiler flags have changed.)  I can tell you that running a build you
think is supposed to be up to date and having it rebuild for unknown
reasons is a common pattern that is hard to debug.  Usually it is
because there is an unknown side effect of the build that keeps updating
some file that is listed as a dependency.  I've often seen this happen
with builds that include code generators that unconditionally regenerate
their outputs and include a timestamp.  Instead, such code generators
should only regenerate the output if they have changed.  This wouldn't
be gradle's fault, but it would be nice if gradle could help the
developer notice what's wrong.  Trying to get this with gnu make is
painful...make -d prints more information than most people can deal
with, and while what you need is there, it's hard to dig it out unless
you're experienced with gnu make's debugging output.

If you don't have a compiler that knows how to generate makefile-style
dependencies, then your next best bet is to look at the preprocessor
output.  The ANSI specification for the C preprocessor (and I think you
can ignore non-ANSI compilers these days...who out there will use a
pre-ANSI C compiler and gradle?).  The preprocessor generates output
that contains lines that look like

# line-number filename flags

You can parse this output to get a list of every file that was read by
the preprocessor.  Those are your dependencies.

Abuild actually includes a little utility called gen_deps that parses
preprocessor output and generates makefile-style dependencies based on
the output.  Abuild's gen_deps does a few other things too like convert
Windows paths to cygwin paths because abuild itself, when running on
Windows, is totally Windows native (and doesn't use cygwin), but it
invokes the cygwin version of make and therefore makefile dependencies
have to use cygwin paths.  (If there were going to be an abuild 2.0, it
would probably not use gnu make anymore but would invoke the compiler
directly.  Having abuild use gnu make was an expediency decision.)  But
the important thing is just parsing that output, which is pretty easy.

So abuild's gcc support passes -MD -MP -MF <name-of-dep-file> to gcc to
generate dependency information, and the gnu make side includes the
resulting dep files without failing when they don't exist.  Abuild's
msvc support, on the other hand, first runs the preprocessor and then
runs the compiler.  Note that most compilers, including all compilers
I've ever worked with, include some flag that can be passed to the
compiler to run the preprocessor.  So for gcc, I would never run cpp
manually but would run gcc -E instead.  With msvc, cl /E does the job.

The advantage of using something like gcc -MD -MP over using gcc -E and
parsing the output is just that you save yourself a little compile time
by not having to invoke the preprocessor twice.  You probably want
gradle to be able to do the same thing: use gcc's built-in support and
have some preprocess-based solution for msvc.  For other compilers,
obviously pick the built-in way if available, but falling back to the
preprocessor way should pretty much always work.

The other nice thing about the general-purpose gen_deps utility is that
it makes it easy for a plugin author to implement fine-grained
dependencies for other things.  If you have some other system that
supports includes, like LaTeX, for example, you could implement
something that parses whatever your source files are and just spits out
something of the form

# n filename

for each dependent file.  Then gen_deps could parse that and you'd have
your fine-grained dependencies without any additional work.  I've done
this in the past for documentation generation and even for my own custom
code generators.

Here are a few other things to keep in mind when thinking about
fine-grained dependencies.  If you can include the command line that was
used to compile a file in the dependency calculation, do it.  This
eliminates the problem of someone overriding preprocessor symbols on the
command line resulting in an inconsistent or non-reproducible build. 
I've seen this often where someone wants to enable debugging on a
specific source file.  They might delete that file's .o and rerun make
with CFLAGS=-DDEBUG.  Sure, that one file will be compiled that way, but
maybe the debugging support doesn't work unless some other file were
also compiled with -DDEBUG.  What I always tell people to do instead is
to have a header file where they #define stuff like that.  That way
modification-time-based dependencies will work.  But if you can actually
include a digest of a canonical form of the compilation statement is
part of the dependency generation, then you are not prone to this problem.

Use checksums instead of modification times when possible.  Abuild
doesn't do this as I've mentioned several times already.  Usually this
is not a huge deal, but there are some really hard-to-catch problems
that can result, particularly in very long builds.  Suppose you have the
following sequence of events:

T1 start compiling A, which depends on B, C, and D
T2 compile B
T3 compile C
T4 modify B
T5 compile D
T6 link A

A the end of this process, A is newer than B, C, and D, but B has
actually changed since it went into A.  A system like make (or abuild
which uses make) would consider A up to date in this case, when in fact,
A is not up to date.  If the gap between T1 and T6 is short or if you
can set things up so that the source tree is quiescent during the entire
build (both of which are a good idea), then you're okay, but I've seen
systems where a build tree can be compiled against a backing build of
dependencies that might get republished during the course of the build,
which can cause exactly the scenario I mentioned above.  This is bad
enough on its own, but it is made even worse if an incremental build
wouldn't catch it.  Remember that C/C++ builds generally take much
longer than Java builds.  At my last job, when we go the build down to
90 minutes, people were celebrating.  I've seen builds that take a whole
day.  (That was a particularly badly designed build though.)

Be careful about code generators that generate multiple things,
particularly when the products are a mixture of source files and header
files.  The most usual case of this is with flex and bison (or lex and
yacc).  In this case, it's important, if all possible, to have the build
tool (or whatever plugin is providing support for the code generator)
know the exact rules about what dependencies are implied by a particular
run of the code generator.  I'll illustrate with an example.

Let's say have A which depends on a.yy.o and b.o, where a.yy.o is
compiled from a.yy.c and b.o is compiled from b.c.  Suppose that a.yy.c
is automatically generated from a.yy and that the generation of a.yy.c
also generates a.yy.h, and finally that b.c includes a.yy.h.  This is a
totally normal thing.  Let's say your dependencies are expressed as
follows (using makefile syntax):

A: a.yy.o b.o
a.yy.o: a.yy.c
b.o: b.c a.yy.h
a.yy.c: a.yy

If you run this build sequentially, the first thing that would happen
would be that the need for a.yy.o would cause generation of a.yy.c and,
as a side effect, a.yy.h.  Then a.yy.c and b.c would both compile. 
Suppose you tried to compile b.c first.  This time, b depends on a.yy.h,
which doesn't exist.  This isn't an error though since a non-existent
dependency is never an error for purposes of figuring out what to build,
but when you go to build b.c, the build fails because nothing knows how
to make a.yy.h.  In this case, there needs to be an explicit dependency
that a.yy.h depends on a.y as well so that the system generates a.yy.c
and a.yy.h before compiling b.c.  Also, it would be nice if the system
knew that a.yy.h and a.yy.c were generated together so that it wouldn't
try to run the command twice, once to generate a.yy.h and once to
generate a.yy.c.  Relying solely on automatic dependency generation in
this case might miss this situation, and you might end up with a case
where the build works if a.yy gets built before b.c and not otherwise. 
That may mean that your build works serially but not in parallel, or
that the build works sometimes but other times it fails until you try
rebuilding, and then it succeeds the second time.  (I've seen all these
cases.)  This is just a normal dependency specification problem that, in
the above case, would be solved by just noting that a.yy.h also depends
on a.yy and ideally that a.yy.h and a.yy.c are generated together.  The
moral of the story is that you have to keep track of dependencies
implied by code generators by knowing what they generate.  The more you
leave it up to developers to explicitly specify dependencies, the more
likely you are to miss cases like this.  This is especially tricky when
you have code generators that generate multiple files.

--Jay

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

    http://xircles.codehaus.org/manage_email

[gradle-dev] C++ concept: fine-grained dependencies and incremental build

Reply via email to