Re: [setup] Uptime

Made Wiryana Tue, 14 Mar 2000 02:40:19 -0800
On Tue, 14 Mar 2000, Oki DZ wrote:

> dimatikan dulu. Jadi rasanya kalau ada server yang on-terus akan tidak
> mungkin, karena bisa jadi peripheralnya akan ketinggalan jaman. Demikian
> pula OSnya.

Ada tulisan menarik dari RISKS forum

The recent America Online blackout has spurred my thinking on the subject
of
decreasing the risks of software upgrades for real-time systems. I want to
highlight for your readers some very significant techniques that I
perceive
to be underutilized to date and, if developed and used widely in the
future,
could hold great promise in drastically reducing the hazards of simple
software upgrades. They are inspired by a maddeningly familiar pattern in
software upgrades one might call "upgrade hell":

The fundamental difficulty we are observing in real-time software is that
the system is often only designed to run one version of the software at a
time. Designers are forced to bring "down" the system while they install
new
software, which may or may not function correctly. Often they can only
test
the full range of behavior or reliability of the new software by actually
installing it and running it "live". Then, if the software fails to work,
a
process that in itself may be difficult to detect, they are forced to
"down"
the system again, and reinstall the old version of the software, if such a
thing is even possible (in some cases the configuration of the new version
is such that an older version cannot be readily reverted to).

This story repeats itself endlessly in many diverse software applications,
from very large distributed systems down to individual PC upgrades.
In pondering this I came to some observations.

1. Many designers currently assume that new versions of software will be
   "plug-and-play" compatible with older versions.

2. Systems are designed to run one version of software at a time.

3. A system has to be inactive during transitions between versions.

4. Upgrades are only occasional and the downtime due to them is
acceptable.

These basic features of software and hardware interplay, despite their
wide
adherence, are not in fact "carved in stone". Could we imagine a directly
contrasting system in which they are fundamentally different?

1. Let us assume that new software is not necessarily compatible with
   previous versions even where it should be, despite our best attempts to
   make it so. In fact let us assume that humans are notoriously fallible
   in creating such a guarantee and that in fact such a guarantee cannot
be
   realistically achieved.

2. Let us imagine a system in which multiple versions of software (at
least
   two) can be running simultaneously.

3. Let us imagine a system that "stays running" even during software
version
   upgrades.

4. Let us assume upgrades are periodic, inevitable, and ideally the system
   would "stay running" even throughout an upgrade.

The above assumptions lead to some attractive properties of the whole I
will
describe. One feature is similar to the way drives can be configured to
"mirror" each other, such that if either fails the other will take over
seamlessly and the bad one "flagged" for replacement. Imagine now that
computations themselves are "mirrored" in the hardware such that two
versions of software are running concurrently, and the software checks
itself for mismatches between the results of the computations where they
are
supposed to be compatible (this can be done at many different scales of
granularity at the decision of the designer). The software could
automatically flag situations in which the new code is not functioning
properly, even while running the old version.

What we have is a sort of "shadow computation" going on behind the scenes.
When a designer wants to run a new version of software, he could "shadow"
it
behind the currently running software to test its reliability without
actually committing to running it.

Under this new system, software upgrades tend to blend together:

- There is a point where only the old version is running.
- Then the old and the new version are overlapping or the new "shadowing"
  the old but the new not actually determining final results.
- Then reliability is actually measured, and continued to be
  measured until gauged sufficient.
- Finally, the actual commitment to running the new version alone
  can be made.
- Additionally, keeping around old versions that can be switched to
  immediately in times of crisis would be a very powerful advantage--
  losing some new functionality but preserving basic or core functions.

There would be vastly fewer "gotchas" in this system than those I outlined
above in the classic "upgrade hell" scenario. Once the concept of
different
versions is embodied within in the software itself by the above
principles,
rather than it being considered foreign or external to the system, we have
other very powerful techniques that can be applied:

A "divide and conquer" approach can be used to isolate bad new components.
Different new components, all part of the new upgrade, can be selectively
turned "on" or "off" (but still shadowed) to find the combination of new
components that creates bad results based on the "live" or "on-the-fly"
benchmarks of previous software. In fact, it may become possible to write
software that actually automates the process of upgrading in which new
versions of the components are switched on by the software itself based on
passing automated reliability tests.  The whole process of upgrading then
becomes streamlined and systematic and begins to transcend human
idiosyncracies.

These new assumptions could lead to radically different software and
hardware systems with some very nice properties, potentially achieving
what
by today's standards seems elusive to the point of impossibility yet
immensely desirable to the point of necessity: robust fault tolerance and
continued, uninterrupted service even during software upgrades. Of course,
the above techniques are inherently more difficult to achieve in
implementation, but the cost-benefit ratio may be wholly acceptable and
even
desirable in many mission-critical applications, such as utility-like
services like telecommunications, cyberspace, company transactions, etc.

One difficulty of implementing the new assumptions above relative to
software is that often such changes need to be made from the ground up,
starting with hardware. But the software and hardware industries have
shown
themselves to be very adaptable to massive redesigns relative to new ideas
and philosophies if they are shown to be efficacious in the final analysis
despite some initial inconvenience, such as object oriented programming.
I
am not saying the above alterations are appropriate for all applications.
They can also be introduced to varying degrees in different situations,
ranging from a mere simplicity in switching between versions all the way
to
fully concurrent and shadowed computation with multiple versions
immediately
available.

Also, I am sure your astute readers can point out many situations in which
the ideas I am outlining already exist. I am not saying they are novel.
However the emphasis on them in a collection as a basic paradigm I have
not
seen before. In the same way that many designers were using OOP principles
such as encapsulation, polymorphism, etc. before they were focused into a
unified paradigm, I believe the above ideas could benefit from such a
focused development, such as designing hardware, software, and languages
that explicitly embody them. Many of the ideas I outline are used in
software development pipelines and distinct QAQC divisions of companies--
but I am proposing incorporating them in the machines themselves, which to
my knowledge is a novel perspective.

Actually, the root concept behind these ideas is even more general than
mere
application to software. It is the idea that "the system should continue
to
function even as parts of it are replaced". We see that this basic premise
can be applied to both hardware and software. It is such a basic attribute
that we crave and demand of our increasingly critical electronic
infrastructures, yet so difficult to achieve in practice. Isolated parts
of
our systems today have this property-- is it the case that it is gradually
spreading to the point it may eventually encompass entire systems?

Many of these ideas come to me in considering a new protocol I am devising
called the "directed information assembly line" (DIAL), a now-brief
theoretical construct that supports such features, which I can forward to
any interested correspondents who send me e-mail. Some of the key
assumptions I am reexamining are those given above. I believe that
actually
changing our assumptions about the reliability of humans to be more
lenient
can actually improve the reliability of our systems. Let us start from new
assumptions, including "humans are fallible", rather than "humans approach
the limit of virtual infallibility if put under enough pressure" (such as
that always associated with new versions and software upgrades).

V.Z.Nuri  [EMAIL PROTECTED]


--------------------------------------------------------------------------
Utk berhenti langganan, kirim email ke [EMAIL PROTECTED]
Informasi arsip di http://www.linux.or.id/milis.php3
Pengelola dapat dihubungi lewat [EMAIL PROTECTED]
Re: [setup] Uptime

Kirim email ke