It is widely acknowledged that the internet is a hostile environment.
There's a plethora of news about malware and other problems. And yet
mostly we seem to adopt a "head in the sand" approach for dealing with
these issues. Or, the software developers I have worked with seem
largely unconcerned about such things, perhaps because other people's
protective work has shielded them [so far] from the failure modes?

Still, an ounce of prevention is worth a pound of cure. So, here are
some thoughts on how to engineer for resilience:

(1) Double entry bookkeeping.
https://en.wikipedia.org/wiki/Double-entry_bookkeeping_system

Any critical information should be stored in multiple ways, designed
so that corruption can be detected and isolated. The trick here is
that you want to isolate and pursue problems which do not make sense.
(If you are hiring for a position for a designer or implementer or
supporter of this kind of thing, people who are fans of Agatha
Christie novels might be good fits - for example.)

(2) People skills.

We [as programmers] are accustomed to solving technical problems, but
the problems worth solving are people problems. And on the internet we
have the joy and privilege of facing international conflicts,
political conflicts, economic failures, war zone issues, and a
multitude of other forms of insanity. All at an arms length, but all
of these things are out there, lurking.

As a result, there's pressures to oversimplify (who wants to deal with
all that?) and while some of that simplification is necessary,
simplifying away from relevant priorities can eat your lunch money for
you.

Plus, we all make mistakes. And, our handlings for our own personal
mistakes can often serve to help ameliorate external failure modes.

So there's a real need to be actively coping with failure modes while
building meaningfully useful things for other people who are also
coping. And, people skills seem crucial, here.

(3) Gathering details on failures.

Any widely deployed software has to deal with gathering information on
crashes (which, in turn, requires people with some ability to digest
those crash reports). Or, if you can't make sense of someone else's
system, build your own, that gathers information relevant to your
design process.

But that's all I can think of at the moment.

The most important part of this, I think, is that you need people who
are level headed about the potential failures. Pretending they don't
happen and/or pretending things are worse than they are tends to get
in the way of reasonable solutions. But you also need a "working
approach" which complements your other priorities.

As a concrete examples:

(1) Checksums (including cryptographic hashes) can help catch some
problems (it's worth thinking about what this does and does not
catch).

(2) Apprenticeship as a design philosophy. If you are working on a
piece of software intended to benefit a professional user, spending
some time working directly for someone who is coping with the problems
you are trying to address can bring the important issues into focus.

I don't have any recent examples of (3).

This is motivated by various ongoing failures I've been observing on
some of the machines I work with. The failures themselves do not make
sense, and no one else seems to report having similar problems. I do
not know what to do about such things, except to encourage people to
try to be building for resilience against failures.

That's all, for now.

Thanks,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to