Proposal to resolve (again) the include dependency hell

Willy Tarreau Sat, 09 May 2020 03:49:08 -0700

Hi all,

I prefer to warn, this is long and only of interest for those who code,
so don't force yourself to digest this if not necessary :-)


Just like every few years we end up in situations where performing what
seems to be small changes just ends up taking a few days moving code around
just due to circular dependencies in the includes. And this time I'd like
to fix that before releasing 2.2, because du to the drop of legacy HTTP
after 2.0 there's already little that trivially backports to 2.0 and
earlier, and I'd rather have a single "backport horizon" between 2.0 and
2.2 than multiple ones caused by the includes mess.

What's the problem ?

The problem is always the same: some include files define some structures,
some other files need them so they have to include the first ones. At some
point a third file needs to know something from the second file and includes
it. Then the first file gets new stuff and is extended to include the third
one and bam!

This used to plague the development process before 1.3 and was addressed in
1.3 by splitting files into "types" and "proto". The idea was that include
files would have one part that only declares types and another one which
only declares prototypes.

This split appeared wise by then (and it resisted for 14 years so it was
probably not that bad). The idea stemmed from seeing that by then most of
the build issues were caused by function prototypes referencing types from
other files.

After this split, a new problem appeared, with foreign types appearing
as pointers in structures or as arguments to function pointers. A typical
example of this is:

  connection.h:
    struct mux_ops {
        int  (*init)(struct connection *conn, struct proxy *prx, struct session 
*sess, struct buffer *input);
        int  (*wake)(struct connection *conn);
        size_t (*rcv_buf)(struct conn_stream *cs, struct buffer *buf, size_t 
count, int flags);
        ...
    };

    struct connection {
        ...
        const struct mux_ops  *mux;
    };

  proxy.h:
    #include <types/checks.h>
    struct email_alertq {
        ...
        struct check check;
        ...
    };

    struct proxy {
        ...
    };

  check.h:
    #include <types/connection.h>
    struct check {
        ...
        struct proxy *proxy;
        ...
    };

This is easily solved by something we initially tried to avoid by considering
it dirty (though its not), which consists in simply placing some forward
declarations in the files which require certain pointers, since for this
purpose we just need to know that such a type exists but we don't care
about its internals. That's how you find this near the top of connection.h:

    struct connection;
    struct conn_stream;
    struct cs_info;
    struct buffer;
    struct proxy;
    struct server;
    struct session;
    struct pipe;

This, by the way, is another very good reason for not using typedefs!

Then another problem started to appear with static inline functions from
each include files each relying on a function from the other file. This
was rare a long time ago when we used to directly access each struct's
field, but as the code started to become more modular with some fields
optional in certain structures, we started to use a bit more accessors,
like si_conn() to retrieve a connection from a stream_int or si_oc() to
retrieve the output channel from a steam_int. That's how you see some
ugly exceptions like this one in channel.h where it was not possible to
call si_rx_room_rdy() which required to include stream_interface.h before,
which itself already includes channel.h:

    chn_prod(chn)->flags &= ~SI_FL_RXBLK_ROOM; // si_rx_room_rdy()

This is not overly present but still leaves a bitter taste when you're
coding and see that your code refuses to build due to a circular dependency
loop.

These ones actually can be addressed using a forward declaration again
for the inline function:

    static inline void si_rx_room_rdy(struct stream_interface *si);

And that's done. The problem is, there's nowhere to place it given that
it's already the proto file which is causing trouble. We could cheat and
place it in types but it would just be an exception to the rule.

A next problem is that some code was imported verbatim and must not be
changed so that it remains easy to updated it from upstream projects (e.g.
XXH, plock, SHA1), and some code is imported then locally adapted thus
cannot easily be rebased to upstream (e.g. ebtree), and some code was
written in order to be reusable elsewhere but we never know what should
be the extent of its own dependencies (e.g. ist, buffer, hpack). There's
nothing hard here, the rules just need to be clear.

And finally the last one, which by far is the better. Who said that history
perpetually reinvents itself ?

We made some of the new code so much modular that some of it is completely
self-contained and may be reused in other components like those in contrib/
or even for other projects, or fuzzers. We have buffer.h or the HPACK decoder
and encoder which are like this. As it is quite pleasant to work in such
situations, this encouraged us to pursue this with some auto-registering code
like the muxes which do not need to be known from outside. This caused major
trouble to implement traces because we couldn't rely on external stuff without
breaking this isolation. And one of the things we started to do while working
on this is that we've started to use single include files again. And guess
what ? unsurprizingly the initial problem surfaced again. For example I
needed to use the swrate_add() function from proto/freq_ctr.h inside the
pools code in common/memory.h, and couldn't because freq_ctr.h also includes
common/standard.h which includes common/chunks.h which in turn includes
common/memory.h!

So I've put a few days of thinking on this and I think I now have a fairly
better understanding of what's wrong and that I can propose a much better
approach to solve all of these issues now, without moving too much stuff.

What was wrong first was the split between types and function protoypes.
As we've seen, types themselves can embark a function prototype, and even
a structure may reference a pointer to a type from another file. So we do
not address anything this way. We *must* accept that there will always be
some forward declarations of foreign types just for the purpose of declaring
a pointer somewhere. That's easy and harmless. Plus it doesn't require any
generalization, just add the declaration before your struct if it doesn't
build, and that's all.

What needs to be separated however are the inline functions, because these
are the ones which access the internals of other structs and force the
include of the files that declare them. Slightly more than half of the
files in common/ and proto/ declare inline functions:

  $ grep -l 'static inline' include/common/* include/proto/* | wc -l
  69
  $ grep -l '^' include/common/* include/proto/* | wc -l
  116

This means that among them, roughly 70 really need to export some functions
into a separate file and 50 don't need to. As such we could get back to a
clean situation where almost anything related to a given subsystem could be
in its own file (what we're seeking in common in fact) and that if inlines
have to be added *and shared*, a new file is created for them.

I think that by thinking in terms of usage and not in terms of content
creation or maintenance, we can easily see how various includes are loaded
from any code part. Includes are supposed to be loaded (explicitly or
implicitly) in this order:

  1) the os-specific includes found in sys/ on the system
  2) the libc-specific includes in the default path
  3) includes from optional external libs provided by the system (dl, SSL etc)

     Note: depending on libc or operating systems, the frontier between
           #2 and #3 may be small, especially regarding low-level stuff
           like pthread or dl.

  4) includes from other components we import verbatim

     Note: having them right here is important because these imported
           components might eventually become installed by default so
           that we don't need to bring our own anymore.

  5) base internals (compat.h, compiler.h, defaults, threads)

     Note: these ones are needed by absolutely every other file. These are
           standard types definitions, default values, array sizes, macros,
           THREAD_LOCAL definitions, function or variable attributes etc.
           The ones may possibly rely on each other in an unspecified way,
           but cannot rely on any upper level ones. Maybe including only
           "base.h" or "haproxy.h" to get them all would be nice.

  6) includes that we're purposely exporting for ease of reuse

     Note: these ones must be able to build outside of the project, but may
           benefit from any of those above. For example a structure might
           be declared using the ALIGN attributes or some inline functions
           written using the likely()/unlikely() macros but such definitions
           should be obvious enough to port outside, or ideally locally done
           inside #ifndef clauses. It's important to note that these ones
           could also one day become standard and appear in #3 if they were
           relevant.

  7) most of the generic stuff that depends on internals above and that is
     not necessarily built (standard tools, debug, traces, most of
     common/*.h in fact).

  8) includes that we've forked to possibly depend on some of our stuff
     above. One example would be ebtree that may depend on our traces or
     debugging provided above.

  9) everything else.

In addition sometimes we may have to #define _GNU_SOURCE before system
includes, though at first glance it doesn't seem like it would help to have
this set into yet another include file.

It still seems very likely that compat.h, compiler.h and defaults.h could
make sense earlier. This could be a way to enclose some system-specific
includes in #ifndef determined from compat.h for example.

As can be seen here the ordering remains quite clear and self-explanatory for
the lower layers, so almost nothing needs to change, except maybe splitting a
file or two. But in the "everything else" case, we have the various type
definitions and all the internal subsystems. We can have two approaches in
order to solve them:

  - either we continue with 2 distinct directories like we have with
    "types" and "proto", but at least "proto" will have to be renamed
    because the prototypes will have to move to types and only the inline
    functions will appear in the second;

  - or we use a single directory and just add a suffix to the files that
    deserve being split.

While I used to prefer the former approach, I'm getting increasingly tempted
by the second one which seems to cause much less work over time. The reason is
that it allows to further split some files which would deserve to. One example
is hpack which comes in 4 files. Another one is the awful "standard.h" which
started as a set of standard tools to complement the libc and grew into an awful
monster that depends on everything and that everything depends on. And even in
an editor it's easier to immediately spot that there are multiple files from
a same prefix than when you have to peek into different directories.

So one first theorical approach could possibly look like this for a given .h
or .c file:

   #include <base/haproxy.h>    // covers compat, compiler, defaults

   // system files (some of them possibly already loaded by compat.h)
   #include <sys/stat.h>
   ...
   // libc files
   #include <string.h>
   ...
   // other libs' files
   #ifdef __ELF__
   #include <dlfcn.h>
   #endif
   #include <lualib.h>
   #include <openssl/x509.h>
   #include <pcre2.h>
   #include <slz.h>

   // other libs we imported
   #include <import/lru.h>
   #include <import/plock.h>
   #include <import/sha1.h>
   #include <import/xxhash.h>

   // common stuff (threads, std tools etc)
   #include <common/hathreads.h>
   #include <common/std-net.h>
   #include <common/std-random.h>
   #include <common/std-varint.h>

   // exported stuff
   #include <export/buffer.h>
   #include <export/ist.h>
   #include <export/hpack-tbl.h>

   // forked stuff
   #include <forked/eb32tree.h>

   // and the regular stuff
   #include <haproxy/stream.h>      // prototypes and everything
   #include <haproxy/stream-fct.h>  // for inlined functions

Obviously no file would have to include all that stuff by itself since some
of it is already included by the previous ones. It could also be argued that
the distinction between common and export is not necessary at all and that
the relevance of exporting some code only comes years after its inception.
Often, placing instructions in the file itself is sufficient for anyone to
remain careful, and till now this has worked pretty well. As such probably
that "export" wouldn't exist and that it would remain like now in "common".

Similarly, placing forked code into "forked" may be a pain because we
usually don't decide from the first day that something imported suddenly
gets forked. Also if our local changes get upstreamed, it becomes legacy
again. So most likely that leaving it in "import" with a prominent notice
that it was locally modified is sufficient. If it starts to rely on local
stuff, it will be able to directly include it anyway so that remains
transparent.

In the end I think that would leave us more or less with this:
  - creation of a new "base" directory to put the few files that must not
    depend on anything else and which are solely used to ease portability
    (compat, defaults, compiler as of today).

  - move of ebtree/* into import/

  - creation of haproxy/ to move everything related to our internal
    architecture there, with a second file when static functions are
    needed.

  - kill types/ and proto/

I'm interested in any opinion on the subject, especially from those who
touch this code all day, but also from those who touch is once in a while
and who experienced difficulties finding their way through this.

I'm probably going to spend a good part of my week-end trying to move this
around to explore various possibilities and come up with more food (or I'll
give up if it's too much of a mess).

Thanks,
Willy

Proposal to resolve (again) the include dependency hell

Reply via email to