Since it's closely related to the discussion we've had about a proper
DB backend, I am forwarding this lengthy message from the PostgreSql
development mailing list.

-------- Original Message --------
Subject: [HACKERS] Proposal: replace no-overwrite with Berkeley DB
Date: Sun, 14 May 2000 23:15:06 -0700
From: "Michael A. Olson" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]

Hi,

My name is Mike Olson.  I worked on Postgres at UC Berkeley from 1988
until 1993, when I left grad school to join Illustra.  I did much of
the access method design, and a bunch of implementation in the btree
and rtree code.  I did a lot of work in storage management, and in
the bowels of the system generally.

I haven't looked at PostgreSQL since Jolly Chen and Andrew Yu shipped
the last version from Berkeley, but I've followed the project from
a distance for several years.  My interest was piqued by last week's
announcement by Great Bridge that it would build a services company
around the Open Source project.

In the last couple of days, I've exchanged email with members of the
PostgreSQL core, and with Ned Lilly at Great Bridge.  I made a
proposal that Marc Fournier, Bruce Momjian, Tom Lane, and others
encouraged me to make to the community at large.  That's what I'm
doing now.

PostgreSQL uses the no-overwrite storage manager originally designed
at Berkeley as a part of the research project.  No-overwrite storage
was a great innovation, and led to a lot of interesting papers, but
it hasn't caught on in database systems generally.

There are several reasons for this.  No-overwrite storage offers two
features to users:

        +  Time travel, so that you can see the database state at any
           arbitrary time in history, and

        +  fast recovery, since you don't need to do log processing at
           restart time.

The first advantage isn't interesting to most people (when was the last
time you had to look at a record from 3pm last Wednesday?).  Even if
it *is* interesting, you have to run vacuum and sacrifice history from

time to time, or you'll wind up giving your entire operating budget
to Seagate for hard drives.

The second advantage doesn't really exist.  Most database systems offer
fast recovery using fairly simple techniques like periodic checkpoints.
Administrators can bound recovery time as tightly as they need to
without using no-overwrite.

And no-overwrite storage has some serious disadvantages.  You need to
keep the file of commit status for every transaction around forever.
Over time, the file will grow to be gigabytes in size.  We actually
had a customer at Illustra who was low on space and deleted it manually.
That's a good way to sell support contracts, but a bad way to keep
customers.  Worse, no-overwrite forces a lot of random I/O instead of
the simple sequential writes that logging systems do.  Every update has
to write the new record and update the t_xmax field of the old record
in place.

I understand that there is already some work underway to replace the
no-overwrite storage manager with a conventional system based on
write-ahead logging (WAL).  I think that that's a good idea, and I'm
glad to hear it's getting some thought.

My proposal to the core team was that they consider using Berkeley DB,
from Sleepycat Software, as the replacement storage manager for
Postgres.
I'll give you a technical summary of Berkeley DB in a minute, but first
I should give a little history.

When I was at Berkeley, my reimplementation of btrees for Postgres
actually marked the fifth time I'd had to write that facility from
scratch for some project or other.  I complained about that to my
office-mate, Margo Seltzer, who is now a professor at Harvard.  She
had a similar frustration with database libraries on ATT-derived
versions of UNIX -- she wanted to use hashing in an application she

was writing, but the different flavors or UNIX all supported different
hash packages with different interfaces.

In our copious spare time, we wrote the first version of Berkeley DB.
It provided single-user btrees and hashing.  It was included in 4.4BSD,
the Berkeley UNIX operating system release, and got used by a ton of
different Open Source and proprietary projects.

The original Berkeley DB code was written by people who knew Postgres
and the interfaces that it required very well.  The btree code in
particular was designed to just drop in and run in Postgres, though
we never took that step, for lack of time.

In 1996, Margo and Keith Bostic (Keith was one of the key people on
the BSD project) were approached by Netscape to add transactions,
logging, locking, and recovery to Berkeley DB.  Netscape was flush
with cash from the public markets, and could pay for the work.
Sleepycat Software was born, and a bit less than a year later,
version 2.0 of Berkeley DB shipped from Sleepycat.

We're at release 3.0 on the Web site now, and 3.1 will be out RSN.
The current version of Berkeley DB has these properties:

        +  Supports four different access methods:  Hash, Btree, Recno
           (based on Btree but with logical record numbers, like heap
           TIDs, for keys), and Queue (based on Btree but optimized
           for write sharing on the head and tail of the queue).

        +  Highly scalable -- terabytes of data, hundreds of concurrent
           threads.

        +  Conventional 2-phase locking, write-ahead logging, and the
           plumbing required to support them, including platform-
           independent mutual exclusion, a shared memory region for
           locks and buffers, and so on.

        +  Industrial-strength features like support for hot backups,
           ability to survive medial loss, very low administrative
           overhead, and so on.

Basically, Berkeley DB 3.0 is a high-performance transaction

processing library.  It links directly into an application program,
and the application calls through simple interfaces to manage
transactions and operate on tables.

Berkeley DB is Open Source.  It's free for use in other Open Source
projects, like PostgreSQL.  If a developer wants to use it in a
proprietary application, then the developer needs to pay Sleepycat
a licensing fee -- that's how we make our living.  But Open Source
projects don't have to pay us anything.  You can download the full
package from our Web site at www.sleepycat.com.

We're very broadly deployed -- most of the commercial directory
servers, and a bunch of messaging servers, use Berkeley DB as their
backing store.  We've got a lot of customers who deploy us in
mission-critical high-throughput apps like network switches.  Most
of the big ISPs and portal sites run Berkeley DB to deliver content
over the Web or to do back-office data management like user tracking
and network status monitoring.

My proposal is, again, that PostgreSQL replace the no-overwrite
storage manager and the current crop of access methods with Berkeley
DB.  I'm familiar with both systems, and I believe you'd wind up
deleting a whole bunch of code and replacing a bunch of calls in
the executor to use Berkeley DB instead.

Wherever PostgreSQL currently creates a heap table, it could create
a Recno table in Berkeley DB.  These permit fast searches based on
record number, as the storage structure is actually a btree, but the
record numbers can be automatically generated.  PostgreSQL can
continue to maintain secondary indices in Berkeley DB exactly as it
does now -- select the indexed access method you want to use, store
the key, and use the record number (the heap tid) as the attribute
part of the record.

Berkeley DB permits you to store a single table in a single file,
but it also allows you to put more than one table in a file.  In
the case of PostgreSQL, for example, you could put the heap table
and all the secondary indices on it in a single file.

Of course, the PostgreSQL engine can update the heap table and
any secondary indices on it in a transaction, as it does at present,
to guarantee atomicity.

The one significant architectural difference between Berkeley DB and
what you do now is that no-overwrite storage management uses
multi-versioning for concurrency control (Vadim pointed this out to
me, and he's right).  Berkeley DB, by contrast, overwrites records
in place and uses two-phase locking and logging to handle concurrency.

My claim is that your users don't care about the mechanism, what they
want is reliable transaction-protected storage.  If you use Berkeley
DB, then a bunch of the hard work that PostgreSQL does now in locking,
figuring out whether a given record is visible, and access method
operation will just go away.  You'll have another, very well-tested,
commercial-grade implementation you can rely on, and it'll also be
Open Source.  You won't have MVCC, but you'll have reliable, scalable
concurrency control, and that's all that matters.

We've talked this over at Sleepycat.  We'd like to see it happen.
We wouldn't see any direct revenue from it, but we think that it
could help make PostgreSQL more broadly deployed, and we'd be able
to point to another major success story when talking to prospective
customers.

Comments are welcome.

                                        mike

----- End forwarded message -----

Reply via email to