Imagine decentralizing Wikipedia with Codeville

Kragen Sitaker Thu, 09 Jun 2005 00:37:41 -0700

(This is 2100 words, or almost 10 minutes of reading, and could
probably become 800 words productively.  I might do that later.)


Support for Disagreement
------------------------

In http://shirky.com/writings/ontology_overrated.html "Ontology is
Overrated," Clay Shirky describes one advantage of tagging systems
like del.icio.us as follows:

    * Market Logic - ... we're moving towards market logic, where you
    deal with individual motivation, but group value.

    As Schachter says of del.icio.us, "Each individual categorization
    scheme is worth less than a professional categorization
    scheme. But there are many, many more of them." ...

    The other essential value of market logic is that individual
    differences don't have to be homogenized. ... with tagging, anyone
    is free to use the words he or she thinks are appropriate, without
    having to agree with anyone else about how something "should" be
    tagged. Market logic allows many distinct points of view to
    co-exist, because it allows individuals to preserve their point of
    view, even in the face of general disagreement.

So support for individual points of view amidst general disagreement
is one of the benefits of del.icio.us over dmoz or Yahoo, and it's
built into the architecture of the system --- it's not just a social
practice.  Could Wikipedia's architecture change to support divergent
points of view better?

In some cases, I think that some technical advance would help with
solving some social problem; for example, I think crazy old Tom Lord
may have been correct that better source-code version-tracking systems
could allow more people to collaborate productively on software by
allowing any user to maintain their own stream of development into
which they merge patches as easily as the official maintainer, a
feature CVS does not support.

There are some people who respond to any such suggestions with the
aphorism that technical solutions cannot solve social problems.  I
think this aphorism contains seeds of truth and seeds of falsehood.
It is partly true, in that existing institutions and patterns of
interaction often contain internal problems that do not depend on any
technical infrastructure.  It is also partly false, however, for three
reasons.

When Technological Artifacts Can Solve Social Problems
------------------------------------------------------

First, technological artifacts --- computers, source-control systems,
whatever --- are not purely value-neutral.  They embed ways of
thinking and expectations about social interactions that reflect the
environments in which they evolved.  For example, Unix provides no
mandatory access controls, because it evolved in an environment whose
users weren't trying to prevent other users from sharing information
with each other, and similarly for the internet; and now the DSL and
cable-modem parts of today's internet often provide no permanent IP
address users can use to publish information, because they were
developed for consumers, not participants.  As a third example, Unix
was developed in an environment of extreme literacy, and consequently
many things about it are unduly difficult for people of limited
literacy.

When an institution adopts a technological artifact, it invariably
changes somewhat to accommodate the cultural expectations embedded in
the new artifact, if only by working around them.  Often this causes
some changes in the structure of the institution.  However, these
changes do not merely change the institution into a copy of the social
environment that developed the artifact, and in some cases may make
the institutions less alike.  Problems inherent in the existing
relationships of the institution usually survive the adoption of new
artifacts.  For example, many companies using Unix simply prohibit
"ordinary users" from logging into a Unix server, in order to prevent
them from sharing information with one another.  The consequence is
sometimes to increase effective controls on information-sharing.

These changes are hard to predict.  William Gibson summed this up in
his line, "The Street finds its own uses for things --- uses the
manufacturers never imagined."

Second, some social problems are simply the result of technical
problems.  For example, nuclear power plants promote centralization of
energy-generating capacity, and consequently concentrate the wealth
generated by production of energy in the hands of the small number of
people who own the power plants.  This social problem of
centralization, however, is in part the result of technical problems
with reactor safety and nuclear proliferation.

Third, new technical artifacts can support the existence of new
institutions, and those institutions may have different structures
from the existing institutions --- and they may crowd them out.  For
example, the telephone made possible single companies that operated
many factories by providing a higher-bandwidth non-market means of
coordinating, and email mailing lists support private topical
discussion groups among geographically-distributed groups of people
with a common niche insterest.

So technical artifacts can create social problems, and by the same
token, solve them.  They can't solve all social problems, and the way
that existing institutions make use of the artifacts depends closely
on the details of the existing institution, so it is very difficult to
predict what will happen.

Source Control as an Example
----------------------------

The particular kind of interaction that decentralized version control
tools, such as Tom Lord's "arch", Codeville, monotone, and darcs (as
well as a proprietary system) aim to support is something like this.
Many people have versions of a piece of software; the software is
broadly similar from person to person.  Many of the people are making
changes to their local version, or "tree"; all of them select changes
from other people's versions to add to their own.  It's possible to
move changes from one "tree" to another to the extent to which they
share common structure, but the structure is of course a function of
these changes over time.

This differs from the CVS approach, in which there's a single "trunk"
stream of development that the maintainer or maintainers adds new
versions to, and other contributors (if they exist at all) contribute
by emailing patches to a maintainer, who can then commit them to the
trunk.  The other contributor may have their own local CVS repository,
but its ability to merge in changes from new "official" versions is
limited.

If an organization tries to adopt the "arch" model of the world while
using CVS, they have several choices.  

They can treat each person's work area as a separate stream of
development, and the CVS repository as yet another stream; this means
that only one of the streams has version tracking.  Worse, if people
try to exchange patches directly, they waste a lot of time resolving
spurious merge conflicts when they both try to commit the same set of
changes to the CVS repository.

They can create a separate CVS repository for each person; this
preserves version-tracking for each person, but makes getting code
changes back and forth considerably more difficult, and doesn't
eliminate the spurious merge conflicts.

They can use a separate CVS branch for each person; this makes CVS
very slow, but retains version-tracking for each person, and makes
getting code back and forth a little easier (though still dramatically
more difficult than ordinary CVS use); it still doesn't eliminate the
merge conflicts.

The nearly universal choice, with CVS, is to adopt a more centralized
model in which the CVS repository trunk is the Official Tree, even in
situations where that is costly; for example, in situations where it
isn't yet clear which design choice is better, or where you have to
support customers with an old version or a very strange hardware
setup.

So here we see a technical problem --- CVS's limitations, stemming
from the limited social dynamics it was built to support ---
reflecting itself in social problems.  "arch", darcs, monotone,
Codeville, git, and other decentralized version-tracking systems aim
to support a wider array of development models; in particular, they
aim to allow each person's tree to stand alone as a first-class
citizen, easily sharing its changes with other similar trees.

Imagine Wikipedia Decentralized
-------------------------------

Imagine that we applied one of these systems to Wikipedia.  We would
have several benefits: tolerance of controversy, disconnected
operation, higher availability, and potentially organizational
decentralization.

We could tolerate controversy better because Holocaust deniers would
have their own version of Wikipedia, which they could modify to their
heart's content.  This would reduce their desire to modify the
Wikipedia that everyone else reads, but it would not eliminate it.

More importantly, though, allowing everyone to modify their own copy
of Wikipedia conveniently, but share the changes with anyone who
wanted them, would reduce the need to support changes by anonymous
people on the main Wikipedia site.  Perhaps a historian, or several
historians, would undertake the task of merging together
history-related changes from many contributors, and the main Wikipedia
site would accept their changes automatically on history-related
articles --- but not history-related changes from other people.

In Linux, there's one fellow who decides what goes into the official
kernel, another fellow who tries stuff out for a while before the
first guy accepts it, and a small number of "lieutenants" who act as
gathering points for patches on particular topics, like memory
management or IDE support, so that the generalist fellows don't have
to spend as much time looking at those things --- they can just accept
en masse.  And the lieutenants have subsystem maintainers who perform
the same function for them.  Also, there are a dozen or so
distributions who take the official kernel and apply their own sets of
patches, gathered from different sources.

All of this is mediated by public discussion on mailing lists where
people publish their patches.

Of the process of selecting these people, Linus Torvalds says (in
http://www.linuxworld.com/story/46051.htm):
    
   "It's not me or any other leader who picks them. The programmers
    are very good at selecting leaders. There's no process for making
    somebody a lieutenant. But somebody who gets things done, shows
    good taste, and has good qualities -- people just start sending
    them suggestions and patches. I didn't design it this way. This
    happens because this is the way people work. It's very natural."

This sort of structure can make it considerably easier to incorporate
desirable changes into the Official Tree without including undesirable
ones.  In Linux, perhaps the majority of changes are undesirable, but
that's probably not the case for Wikipedia; so Wikipedia probably
would benefit from a much more streamlined user interface for
submitting and accepting patches.

Presumably Holocaust deniers would have a hard time getting their
changes accepted by anyone, and so they would have a hard time getting
other people to contribute on their server.

Right now, Wikimedia's servers are a single point of failure for all
the millions of people who benefit from the whole Wikipedia project,
and the thousands who contribute.  

Perhaps I would pull new changes from the wikipedia.org server onto my
laptop each night so I could consult Wikipedia when I was offline.  I
might make changes to my local copy of Wikipedia, and the
version-tracking system (Codeville or whatever) would keep track of
those changes.  When I got back online, I could submit them to
wikipedia.org, or the relevant subsystem maintainers, or whatever.  If
the particular server I wanted to submit my changes to was down for a
few days, it wouldn't be a big deal; I'd still have my local copy and
my local changes, and I could still share them with other people.

Presumably other people would also run their own public mirrors of
Wikipedia.  In fact, there are already such mirrors, mostly set up by
SEO spamming scum, but right now, they don't let you edit the articles
--- what would they do with the edits?  Overwrite them with the next
version they copy from Wikipedia?  Consequently, these mirrors are
mostly not very good.  Imagine that they were actually making positive
contributions to the Wikipedia community, though: with their own
communities of contributors vetting changes, many good changes would
get made and reviewed without ever having to hit the main Wikipedia
site.

Wikipedia now has a budget, a team of system administrators, a
bandwidth bill, fund-raising drives, and banned-IP lists --- the
inevitable consequence, for now, of the operational centralization of
a service useful to all the people of the world.  This operational
centralization might happen even if the underlying software supported
a more decentralized structure, but it wouldn't need to.

It should be obvious that I think this centralization is a necessary
evil, and I hope this approach would make it an unnecessary evil.  But
kernel.org --- where people download the Linux kernel --- currently
has two Proliant DL585 quad-processor Opteron boxes with 24 GB of RAM
attached to two one-gigabit network links and a 10-terabyte disk
array.  That's maybe US$100 000 of machinery (each server is about
$40k) to serve a relatively small part of the world's population.  So
clearly the fact that copies of the kernel are all over the net
doesn't dissuade people from using the canonical site.

Credits
-------

I greatly appreciate the help of Brett C. Smith and Rohit Khare in
discussing these ideas; I also drew on the writings and speeches of
Clay Shirky, Greg Kroah-Hartman, Linus Torvalds, Joel Spolsky, Tom
Lord, and William Gibson.

Imagine decentralizing Wikipedia with Codeville

Reply via email to