[Couchdb Wiki] Update of "Why" by JoanTouzet

Apache Wiki Fri, 06 Apr 2018 10:21:06 -0700

Dear wiki user,

You have subscribed to a wiki page "Couchdb Wiki" for change notification.

The page "Why" has been deleted by JoanTouzet:

https://wiki.apache.org/couchdb/Why?action=diff&rev1=2&rev2=3

Comment:
This feels like a personal blogpost, not something representative of the
CouchDB community. Sorry.

- ''This is a work in progress, I hesitate to even call it a rough draft yet.
It's put on the wiki to make collaborative edits easier.''

- With enough work, you can find a way to
[[http://en.wikipedia.org/wiki/Database_normalization|express anything
relationally]]. This is the "brilliant" insight into relational theory that
[[http://en.wikipedia.org/wiki/Edgar_F._Codd|Codd]] had and every relational
database zealot has harped on forever since. The worst of these zealot will
tell you SQL is an abomination because isn't really relational, it adheres only
to "some" of the relational principles.
[[http://www.dbdebunk.com/index.html|Some people seem to have made entire
careers out of zealotry.]] Yet SQL has been very successful for a very long
time, despite the obvious deficiency of not being "fully relational". Why? Why
is it successful if it doesn't and can't adhere strictly to these relational
principles?
-
- Because no one really gives a shit.
-
- Why should a normal person care in the least about relational purity? What
they want is a database that stores their data, tells them interesting things
about it and gives it back to them reliably.
-
- ''The reason SQL is successful is not because of how well adheres to
relational principles, but because it deviates from those principles in ways
that are convenient in real use.''
-
- On the other end of the spectrum, the opposite of the pure relational model,
you have a world of pure objects, where data is treated as unstructured,
indivisible units and more closely represents how data is actually represented
and passed around in the real world. The relatively recent object-oriented
programming paradigm is an extension of this model and in storage we've had the
concept of files since just about forever.
-
- This self contained, isolated object model, as it relates to the web, is
what's also becoming popularly known as the
[[http://en.wikipedia.org/wiki/REST|REST]] model. And as so many of its
proponents love to claim, it scales like crazy. In fact, REST, representing and
manipulating data as address chunks, is a realization and a formalization of
how most of the internet works and scales.
-
- But just like very little of our data is completely relational in nature,
very little of our data is completely distinct or unstructured. Most of our
data is has some common structure and is completely self contained, but refers
to external data and concepts.
-
- Examples of data that has characteristics of both are easy to find, for
example a restaurant receipt. A receipt is both a self contained document and
relational piece of data, its purpose is to record data and meta-data about a
financial transaction. It records definitive facts, things that are always
true, like the method of payment and the amount paid, but it also records more
conceptual things, like the name of the restaurant involved and the dishes
served. Recipes and ingredients change, and restaurants can be renamed, sold
and moved. The link between receipt and these external concepts, like the
restaurant, can be broken in a number of ways. Even the value of money changes
over time, not just with inflation, but also units of currency can be
completely revalued by governments. The information on that receipt, at any
point in time, is of unknown accuracy and verifiability. And yet despite all
these short comings, a receipt is still useful.
-
- And so the receipt data, despite some relational characteristic, is
problematic to represent "correctly" in a relational database, because there is
no absolute answer to what the meaning of each piece of data is and how it
relates to everything else. In paper form, the receipt preserves the original
restaurant information regardless what happens in the real world, it’s not just
a record of relations but of facts at a place in time. And even if the all the
things that link the receipt back to the originating restaurant change, the
receipt still has value. This is the "temporal" aspect of data and is something
the relational model doesn't handle well.
-
- But like pure relational modeling isn't very productive for many real world
problems, modeling data as isolated objects isn't terribly useful either.
Beyond acting as a fancy file server (like WebDAV), the REST model for data
storage and sharing is too simple to build real applications on, objects alone
are far too restrictive to use as a general storage model, it doesn't do enough
of the things we want and expect from our applications. So instead SQL
databases, with their abilities to slice and dice and tell us interesting
things about our data, continue to dominate.
-
- What is needed is a way to make these discrete objects be a little more
relational. To aggregate, organize and report on data that hasn't already been
decomposed into tables.
-
- This was very much the idea behind WinFS, Microsoft's spectacularly brilliant
and disastrous unifying vision for data modeling in the Windows world. The
problem was Microsoft couldn't make it work efficiently (not to mention they
seriously overreached, trying to make WinFS a solution to solve all problems,
before it ever solved a single problem).
-
- One of the reasons I think WinFS has failed is it is too rooted in relational
technology, the reporting engine is based in large part on SQL Server. MS is
far from the first to fail in this area, building object databases on top of
relational technology. In Oracle-land, attempts to reproduce the similar models
even has a name,
[[http://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-big-design-mistakes.html|Entity
Attribute Value model]], and is generally considered a very bad idea. To this
day I've never seen anyone build it on a relational store, make something
flexible, easy to use and that gives suitable performance.
-
- Even IBM has been working for years to get DB2 to serve as a persistence
layer for Notes and Domino, and they still can't get it stable, easy and
performant enough to be released gold. IBM is working on it because customers
keep saying how much they love the idea, but the reality is that it's damn hard
to build on top of a relational database. Essentially they are try to emulate
one general database and reporting model on top of another general purpose
database and reporting model, but the models conflict.
-
- The updates and usage model of CouchDb is a lockless, optimistic model that
is pretty much the opposite of how the relational model works. To edit an
object, you load it up, make your changes, then save them back to the database.
If someone else edits at the same time and they save their changes before you,
you'll get a conflict error on save. To resolved the conflict, the updated
document can be opened and edits can be applied to it and try again.
-
- This storage and access model is such that any number of hosts can be
replicating while simultaneously any number of users can be reading, editing
and deleting documents, performing lookups and reads in computed tables which
are simultaneously being updated for other users. Document writes are
serialized, except for file attachments and binary blobs, which are written
concurrently. And the database never needs to “lock” to do this.
-
- Relational databases, or more specifically SQL databases, simply don't make
good document databases, the fundamental set of compromises made is all wrong.
If you want something that performs well and is reliable, you need to start
from scratch and think carefully about the whole design. Lotus Notes did this
long ago, and over the years a lot of very talented people have been
successfully adapting and expanding that design to keep it competitive. And
Notes and Domino continue to be big money makers despite how many people are
other annoyed by the software, because it makes it simple to do the things that
a nightmares to deal with in SQL.
-
- CouchDB addresses the data storage problem from the non-relational, RESTful
point of view, and is designed and optimized in multiple, complimentary ways
for that world view of semi-structured, document oriented data. And by
providing a specific but flexible document storage, query and replication
model, it hits the sweet spot storage and reporting for things that are more
"document oriented" than relational, a huge class of problems in application
development.
-
- Just as SQL databases rarely are used for data that is entirely structured
and relational in nature, Couch databases rarely will be used for data that has
no structure, but where the interrelations and structure aren't strong and
might be broken. The CouchDB computed table reporting engine adds back
structure and organizes semi-structured, document data and while allowing it to
stay in it's preferred state.
-
- The CouchDB document and query model aggregates documents into a manageable
tables. Unlike SQL database, where your data must be carefully decomposed into
tables ahead of time, the tables on CouchDB are built on the fly and aggregates
the already existing data.
-

[Couchdb Wiki] Update of "Why" by JoanTouzet

Reply via email to