Achieving such a change is a often more of a social problem than a technical

From: "Tom Roche" <> Sent: Monday, April 23, 2012 1:06 AM

summary: background, motivation, and plan for a tree-structured
process is presented. My question is (roughly), should the "code bucket"
which code is committed at each level of the process be implemented by a
separate repository, a separate branch on a single repository, something
else, or does it matter?


Apologies for the length of this post, but it seems there's a lot to
explain. In reverse order, I have a question about a plan, for which I
present the motivation and some background: feel free to skip over parts,
but I suspect it "all ties together."


I recently began to work with a group that recently had an embarrassingly
extended release. The short story is, we kept throwing what we thought was
good code over the wall to the beta testers, who kept throwing it
5 months. The long story/etiology includes:

1 Our people are software engineers only by default. They're really
scientists who code, and who have learned a bit about software engineering
"just by doing." But their focus has been on what they do with their code,
not their tools or development process (until now), which can seem pretty
crude (at least, to a coder who's starting to learn the science, like me).

Been there. Been that engineer.

2 We have a very centralized dev process. There is one CVS repo to which
everyone commits. (Technically, there are several: since they don't know
to branch or create read-only users, they just clone the filesystem
everytime they want to freeze something. But for commits there is only one
repo.) Everyone commits to HEAD, for the reasons why most CVS users don't
branch. Theoretically everyone runs a big bucket o' tests before
in practice, there's a small group (2 guys) who manage releases and
actually/reliably test.

3 We have a very long release cycle: several years, for which there are
apparently some legitimate reasons. But we don't do intermediate
integrations, or manage dependencies; ISTM, that's just slack, and means
that pre-release testing follows a painful pre-release integration of code
from our many contributors.

Related to this etiology are the following continuing constraints:

4 Resource: our funding is flat, and our group's headcount is actually
declining (retirees are not being replaced). We are supplied with
contractors who service our clusters (more below), but no other computing
support (other than desktop support for "productivity apps" like Lotus
Notes). We need more contributions from our community of users (which I
suspect many could/would give), but, for legal reasons (not related to
licensing--the code is open-source), it's hard for us to enable access to
code that has not been "fully reviewed." (More on excessive security
These are longer-term problems :-(

5 Automated testing of large-scale scientific models seems inherently
(If there's anyone out there working on this problem, please ping me
offline--I'd like to learn more.) There are ways to attack this in which
definitely interested, but that's also a longer-term problem.

Some interesting papers:

The described in 09_Hook looks interesting for
gaining some level of _statistical_ confidence in the quality of the
software. Also see Kelly's ref 8 about just how wrong scientific software
can be.

6 We are not mobile developers. We run and test our code on a couple of
clusters which are behind some exceedingly strict firewalls--so strict
few folks have the ability to VPN (aggravated by the resource constraint),
and it's painful for those that do. We can't ssh or https out of the
cluster, which complicates sharing of code (via, e.g., github) and data.
Hence folks work on code almost entirely from their desks (which are on
that have cluster access) and not from home or on the road. This is also
likely to change anytime soon.


My group intensively uses our tool for our scientific work (we majorly
our own dogfood"), but we also have a significant external community of
users. The 5-month delay of an announced release was therefore rather
embarrassing, and we also realize that it wasted lots of time/effort. Now
that we're planning for the next release, I'm proposing some process
upgrades to address those problems. Some proposals are no-brainers, or at
least are off-topic for this post:

* CVS -> git: the following plan presupposes we do this. This is not quite
no-brainer, since we'll hafta train folks how to use git, but I can't see
disadvantages to migration that aren't outweighed by the advantages of

This may be harder to implement that simply installing the tool and training folk in it's use. There will be a whole set of working methods and assumptions that will be challenged by such a change. Such as the loss of multiple versions all available at the same time, folk deleting directories loosing the repo with it, not being able to find their work, simply "git reset --hard" .. etc.

You'll need to find the features that the users desire, or hate, and make you solution an easy win. E.g. on Matlab code is duplicated because you can't have two versions of a function because the filename must be the function name, so folk duplicate the codebase with consequent loss of control - solved with branches for function variants..

* dependency management (a bit more on dependencies below)

* shorter release cycle || intermediate integration builds using specified
dependencies (and that's a boolean 'or')

(If you've got reasons why not to do those, please post me separately, and
not on this thread/Subject.)


My final proposal is more complex. I'd appreciate comments on it,
particularly regarding an implementation detail discussed below. This
implementation detail reflects the similarity and differences between git
repositories (or remotes) and their branches. Since in git the difference
the user between {pushing code to, pulling code from} any particular
on any particular repository can be made fairly transparent (am I missing
something?), I'll just use the term "code bucket" to refer to something
from/to which one can pull/push.

For better testing and evaluation, I'm proposing that we move from a
centralized process/repository to a tree structure. The release managers
(who have other jobs--they do this "on the side") are empirically
Make sure your workflow helps both sides of the fence here - allowing workgroups to moixe, merge and rebase their codes until it's good before passing to the release manager (i.e. make sure they don't havce to do all the review stuff they did previously)

    so ISTM we need better "division of labor," i.e.,
of test and integration effort. Furthermore, we already have workgroups
which discuss and prioritize big function chunks (e.g., chemistry,
meteorology, land cover), and project groups working on smaller ones
aerosol nucleation), "in between" the individual scientist/coder and the
top-level management/repository. (Note that everyone belongs to more than
one workgroup and project team: software is modular, but nature is not.)
I'm trying to leverage those groups to get the necessary integration/test
work done, and give the release managers "fewer throats to choke." The
proposal is, bottom up:

1 Each coder gets her/his own bucket, for her/his own code, on which s/he
tests as s/he will. The main difference between that and the status quo
(besides cvs -> git) is, s/he will be required to publicly declare (on our
group's wiki) what test(s) s/he runs.

2 Each project (i.e., one or a few function points we want to add or
gets assigned to a project team (PT). Each PT

* has a declared lead, who is responsible for that project, and represents
the PT at workgroup meetings.

* must declare what test(s) it runs on its code.

* has its own separate code bucket. When a member coder wants to "commit,"
s/he requests pull from her/his PT lead, who pulls/merges/tests. The PT
evaluates the results; if satisfactory, the PT lead commits to that PT's

3 Each workgroup (WG) is like a super-PT: a WG integrates the code from
member PTs in the way that each PT integrates its team members. A WG

* has a declared lead, who is responsible for its set of function, and
represents the WG when meeting with the release managers.

* must declare what test(s) it runs on its code.

* has its own separate code bucket. When a member PT wants to "commit," it
requests pull from its WG lead, who pulls/merges/tests. The WG evaluates
results; if satisfactory, the WG lead commits.

4 The release managers (RMs) integrate the code from the workgroups. The
collectively determine, for a given release or integration build (IB),

* dates

* what its dependencies will be (i.e., on what versions of (e.g.)
and compilers that release or IB must run)

* what function goes in (the determination and arbitration of which seems
consume lotsa work)

  The RMs also

* must declare what test(s) it runs on the release or IB

* manage the top-level (separate) code bucket. When a WG wants to
it requests pull from an RM, who pulls/merges/tests. The RMs evaluates the
result; if satisfactory, the RMs commit.


My general questions are, does the plan above seem

* feasible, given our constraints?

It's a culture change which, if you leverage it from the right point, could do very well, or could be crushed by the inertia of the current working methods.

* solvent: does it seem likely to solve the problems described above?
(notably, that the centralization of our process is overwhelming the folks
at the center)
Do they appreciate they have a totally broken process based on false premises[*1*], or do they still think that folks just need to do it properly (bigger stick!) and it'll be right? Unless they realise that the fundamentals are no longer fit for purpose they will continue to return to it.

My specific question regards the implementation of the "code bucket" at
of the levels above: should it be implemented by

* a separate repository

I'd expect that with such specialists you will need to give each specialist their own repository, if only as an 'archive' against mistakes on their local machine... As they'll (most likely be) all the on the same server they can share the underlying repo objects, so it won't be a real overhead.

* a separate branch on a shared repository

* Something Completely Different

? I'm leaning toward separate repositories, but am wondering if there are
performance or operational details of which I'm unaware, given the
constraints. To be more specific, the implementation I currently favor is,
for each level:

1 Each coder gets a separate git repository on her/his desktop, which is
a LAN that can ssh (and therefore run protocol=git) and https into the
cluster. Unfortunately these are mostly windows (XP), but I'm presuming
runs well enough on that--am I missing something? (I run debian, and am
mostly blissfully ignorant of platforms != linux.) Coders would also be
to create repositories on the /home filesystems on our clusters (which run
RHEL 5, but may soon be moving to CentOS 6). On their repositories, a
would be free to create branches and tags as desired.

Just be wary of networked directories which users can 'corrupt' acting as repos. Treat such network repos with care.

2 Each project team lead gets a separate repository on one or both of the
clusters. (We can ssh/git between the clusters, and between the clusters
the desktop LAN, but can neither ssh/git nor https from either clusters to
the outside world.) PT leads are also free to branch and tag at will on
their repo.

3 Each workgroup lead gets a separate repository on one or both of the
clusters. WG leads are also free to branch and tag at will on their repo.

4 The release managers would maintain a separate repository on one or both
of the clusters. Branch=master would, at any given time, hold the latest
release or integration build. Immediately before a release or integration
declared (only following its successful testing!), the current contents of
branch=master would be branched with the date of the integration, or the
release number; then the contents of the current release/IB would be
committed to branch=master. RMs may also create other branches or tags to
facilitate integration and release.

your review is appreciated, Tom Roche <>

I hope the comments help - I've so far failed at my work in a similar situation.

[*1*] Most 'configuration control' strategies are grounded in the 19th century drawing control as would have been practiced on the Titanic with Kaolin and Linen drawings using India ink. Back in such times they could trace drawings and then crteate a 'blue' (negative) print from the tracing. They had to protect the one and only master drawing from all forms of damage and misguided changes. Only the most senior of draughtsmen were allowed to carefully screpe away the old line and add the new changed line. All the processes you see today are based on such DO (drawing office) practice.

But everything has changed. Perfect duplication and distribution of the master and its copies is now easy. Peolpe take copies and create new works from them, and only then ask that they be 'certified' as suitable for use. Whether it is MS Words document compare, or any of the source code diff packages, the computers even document the changes perfectly. The review porocess is simply about deciding if the new version should be accepted. But most processes still have "change request forms" to document the changes you have already made.

Git acknowledges the new working practices and provides a simple, verifiable, ID for every possible configuration status being considered (The sha1 of the commit, and its DAG history). But it is a BIG paradigm shift for many folks.

You received this message because you are subscribed to the Google Groups "Git for 
human beings" group.
To post to this group, send email to
To unsubscribe from this group, send email to
For more options, visit this group at

Reply via email to