[RT] Dreams for a useful database

Stefano Mazzocchi Fri, 30 Nov 2001 16:50:25 -0800

Disclaimer: these are not only random thoughs, but are also strong
opinions. I've been wrong in the past and might be wrong even here. If
you think so, please, make yourself heard: I'd love to find myself wrong
sooner rather later, at least, before we spend a bunch of energy and
time to implement something that isn't useful.


This RT is, in fact, a way to trigger discussion before people start
writing something that might end up being more harmful than useful
(expecially in the XUpdate direction which I found a bad direction, as
my previous messages outlined).

                                   - o -

Let me start outlining a few concepts:

1) I believe that XML databases are useful for semi-structured data
only. If you have data that fits into a relational DBMS, help yourself:
use your relational DBMS.

Period.

Using a native XML DB to store relational data, or data that could
easily be described by a simple and effective relational model, is,
IMHO, not only wrong: it's a technological suicide.

Documents, reports, books, articles, vector graphics, multimedia
interactive animations, 3d world descriptors, math formulas, chemical
formulas are all examples of semi-structured data.

Genetic footprints, topic maps, invoices, stock quotes are all examples
of fully structured data, even if very well described by XML markup.
Don't let the syntax fool you: it's the schema that counts. 

If the schema describes a fixed set of elements, not matter how complex,
you should use relational mapping. Otherwise, but this is much less
common that it first appears, and only otherwise you should entering the
native XML DB land.

When you hit the wall on native XML DB and somebody shows you how
faster, simpler and more elegant would be to use a relational mapping
for it, don't tell me I didn't warn you.

                                 - o -

Ok, let us suppose we have a bunch of semi-structured data we want to
store someplace and be able to retrieve it later, possibly using some
powerful xml-based query language (such as XQuery) that goes as deep as
single-node granularity (which is something that RDBMS can hardly
optimize).

Let's define this database environment of my dreams:

 1) it should be possible to see it as a single persistent tree of DOM
nodes.

 2) it should contain node-level version information and should provide
a tagging concept (here, think as a parallel between CVS files and these
DOM nodes).

 3) it should be namespace friendly (every node should have a namespace
where it is meaninful).

 4) it should be ID friendly (document IDs should be virtualized since
the XML spec says that you can't have more than one element with the
same ID.

 5) it should provide XQuery capabilities

 6) it should provide fine-grain access control

 7) it should provide the ability to store binary objects as well
(encoded as CDATA sections or xlinked)

Think of a tree-shaped XML-only CVS with granularity up to the single
node and powerful searching capabilities.

This is what I dream to see Apache XIndice become.

                                     - o -

Now we have the data and we have the database: how do we insert this
data into it?

Here is a list of my requirements for such an operation:

 1) it should be as trivial and intuitive as possible. In theory, it
should be as easy as saving a file on a file system.

 2) it should suppor the complete XML infoset, thus should not limit the
XML functionality.

 3) it should not mix concerns: inserting data is *separate* from
querying data. Even transforming existing data is another concern.

One possible way of doing this is to provide a virtual file system view
of the database: I've seen two commercial products that did this:

 1) Oracle Portal-2-Go (now dead) used a virtual FTP server on top of
the persistent DOM implementation. I've done consulting for them in
Sweden (which probably resulted in the ultimate project death) and they
showed me writing a document using Emacs over FTP, getting it validated
on the fly and stored into the DB (guess what DB they used :) so they
were forced to implement a very abstract child-parent relationship model
which resulted in tables with millions of entries. The FTP server alone
was some 50000 lines of java code. It was 1999. They still don't know
what to do (I bet you they are going to wrap Cocoon and sell it once its
recognized as mainstream, just like they did for JServ on their app
server) They used to be a research-oriented software company. Oh well.

 2) Software AG Tamino. They wrapped Tomcat and its WebDAV servlet on
top of it so that they now provide WebDAV support. Never tried it myself
nor had any contact with them so I can't really judge their
implementation.

This summer, when I was thinking about the "global CMS" I thought that
the WebDAV view was very cool since it provided a nice metaphore for
people to work on and a simple interface to implement (almost all OS
provide some WebDAV functionalities and I expect this to grow even
more).

While I still think a WebDAV interface over such a DB could be useful, I
came to the conclusion that is still not easy enough for the majority of
the users of such a CMS.

See the other inline editor thread: people ask the CMS for "what" and
they should do the "what->where" mapping. For WebDAV, we could make this
mapping as trivial as possible, but we can't remove the freedom to
choose which folder to save your stuff on.

It's not obvious for technicians used to this kind of "what->where"
reasoning to see this as a problem, but it can be shown that the disk
metaphore is extremely poor for storage systems which require solid
positioning contracts.

Why? well, because you break the first law of usability:

 never give you user more freedom than they need to get their job done.

If you are a system administrator shuffling things around, you *need*
that much freedom. But if you are a content writer you don't and placing
the right content in the wrong location could ruin the entire effort
since this data could not be visible in the CMS.

Sure, even power users make mistakes so access control should be mapped
with on fly validation attached to that folder-behaving node that could
reject invalid concent.

But some users should not be given this freedom to choose where to store
stuff.

Moreover: if we take into account versioning, revisioning and workflow
control, a file system metaphore is ultimately poor.

Having a WebDAV view might be useful for hard-core power users, but we
must come up with something else for a good CMS.

                                 - o -

I came to the conclusion that what we need is an API, then you design
the application that uses the database features as you like.

This goes into the 'toolkit' approach, rather than giving you a
pre-packaged solution.

It's the "framework" approach, the one I like best (in case you didn't
already know :)

Let's write some pseudocode (note: I don't know the XML:DB API just yet,
I'm making this up entirely to show the concept):
 
 try {
  Database db = DatabaseDiscovery.lookup("host", "mydb", "username",
"password");
  Location location = db.locate("/news/europe/italy/sport/football/");
  OutputStream os = location.getOutputStream("log message");
  ...
 } catch (AccessDeniedException e) {
  ...
 } catch (ValidityException e) {
  ...
 } catch (WellformnessException e) {
  ...
 } catch (DatabaseException e) {
  ...
 }

Here are a few problems with this approach:

1) lack of namespaces support in note location: if we whan to locate
namespaced nodes, for example

 db.locate("/news/geo:europe/geo:italy/sport/football/");

we have to indicate the prefix->uri mapping. A possibility is:

 db.setNamespace("geo","http://www.geography.org/...";);
 db.locate("/news/geo:europe/geo:italy/sport/football/");

But I have the feeling this is getting into FS since the 'container
nodes' should be all made by special a special namespace in order to
allow a simple and valid file system abstraction on top of it

 <db:database xmlns:db="...">
  <db:news>
   <db:europe>
    <db:italy>
     <db:sport>
      <db:football>
       <news:news xmlns:news="..." date="20010223">
        ...
       </news:news>
      </db:football>
     </db:sport>
    </db:italy>
   </db:europe>
  </db:news>
 </db:database>

or course, to simplify usage, the db.location() method could
automatically use the DB namespace to locate nodes or use directly
internal indexes to get to the requested location.

With this, we could have a simple yet very solid way to discriminate
between "data nodes" (the file system equivalent of files) and "location
nodes" (the file system equivalent of folders).

This would also turn the location path a normal path instead of an
XPath, since we wouldn't need that functionality for data inserting (at
least, I can't see any good reason to have it)

2) lack of inserting action indication

Suppose we use the above concept to separate db: nodes from data nodes,
then we must have a way to indicate "how" the data is inserted: we have
a few choices:

 a) element is prepended
 b) element is overwritten
 c) element is appended

Note: since we have revisioning, overwriting actually means storing a
different version on top. Data should *never* be removed from the
database (as in CVS).

I think that prepending/appending doesn't make sense at all: you
shouldn't count on the cardinal location of your element at retrieval
time, so it should not give you the ability to choose what to do with
it.

Element location makes sense for "inter-document" updates, but I think
the concept is broken very early in design: in order to come up with an
XUpdate-like document, you need the original document, the changed
document, create a xupdate diff, submit that and have the db handle the
changes.

XUpdate would make sense if diffs were generated without information on
the previous data, but this is almost *never* the case, so I think it's
much more sane to insert full documents and let the database handle the
overwriting/appending depending the specified inserting action.

So, suppose we have

 <db:database xmlns:db="...">
  <db:news>
   <db:europe>
    <db:italy>
     <db:sport>
      <db:football>
       <news:news xmlns:news="..." date="20010223">
        ...
       </news:news>
      </db:football>
     </db:sport>
    </db:italy>
   </db:europe>
  </db:news>
 </db:database>

then we have

 <news xmlns="..." date="20010410">
  ...
 </news>

and we want to append it as another news, we have two choices:

 1) configure the DB with what ID attribute is to be expected for this
location (XMLSchema already provides some functionality for this)
 2) explicitly state this from the API.

The first solution fully separates concerns but makes db configuration
critical: setting the wrong ID completely breaks the system and we might
end up with a thousand versions of the same news, instead of thousand
news with one version each.

This solution doesn't require any change in the above code: it's the DB
that checks if the inserted news has the same ID (the file system
equivalent of file name) or not. If the ID is the same, a new version is
created on top (like CVS does), if no ID is already present under that
node, the node is appended.

The second solution sounds easier, but mixes concerns since the
programmer is now responsible to drive the inserting behavior of that DB
location. Code would be something like

  Database db = DatabaseDiscovery.lookup("host", "mydb", "username",
"password");
  Location location = db.locate("/news/europe/italy/sport/football/");
  location.setAction("insert|overwrite");
  OutputStream os = location.getOutputStream("log message");

I far prefer the first, also because it makes it easier to implement it
as a Source since the resulting URI is easier.

 db:username:password//host/mydb/news/europe/italy/sport/football/?log

                                  - o -

Let me sum up the resulting inserting behavior:

 1) the db is composed of db nodes (metadata) and user nodes (data),
namespaces separate the two.
 2) there is ability to insert whole documents only and only descending
from a db node. This completely removes the need for XUpdate-like
languages.
 3) the DB uses db:path+ID to discriminate between documents.
 4) the DB is capable of managing revisions of entire documents,
probably by saving diffs instead of full trees (but this is an
implementation detail)

One last thing is missing:

 1) in order to allow workflow management, we should include in the root
node of any document some namespaced metadata that indicates the status
of the document in the workflow.

A db dump could be something like this:

[internal view]

 <database xmlns="path-namespace" xmlns:db="db-namespace">
  <articles>
   ...
    <db:versions db:ID="SM - 20010223 - My Article">
     <db:version db:number="1.0" db:status="published">
      <article xmlns="...">
       <author name="Stefano Mazzocchi" id="SM"/>
       <title>My Article</title>
       <body>
        <para>...</para>
        ...
       </body>
      </article>
     </db:version>
     <db:version db:number="1.1" db:status="pending">
      <article xmlns="...">
       <author name="Stefano Mazzocchi" id="SM"/>
       <title>My Article</title>
       <body>
        <para>...</para>
        ...
       </body>
      </article>
      <db:comment date="20010225" by="SM">
       <xhtml:p>I changed the second section as you suggested.</xhtml:p>
      </db:comment>
      <db:comment date="20010225" by="ZZ">
       <xhtml:p>Yuck! c'mon, Stefano, you can do better than
this!</xhtml:p>
      </db:comment>
     </db:version>
    </db:versions>
   ...
  </articles>
 </database>

this requires:

 1) a pretty powerful way to come up with a complex-type ID for each
node in a automatic way.
 2) a way to send two streams, one for content and one for comments.
Sure, we could use strings, but then we could miss the ability to add
stuff like visual information on where the comments can be visually
presented (as in Adobe Acrobat or EQuill) and have graphic capabilities
(like drown arrows, sticky post-it notes, etc).

NOTE: the above dump is the "internal" structure of the DB, if we ask
for the "public" view of the DB (yes, we could use the view concept here
as well, in fact, I got it from the db world).

[public view]

 <database xmlns="path-namespace" xmlns:db="db-namespace">
  <articles>
   ...
    <article xmlns="...">
     <author name="Stefano Mazzocchi" id="SM"/>
     <title>My Article</title>
     <body>
      <para>...</para>
      ...
     </body>
    </article>
   ...
  </articles>
 </database>

which hides all the versioning/workflow information.

So, the database should provide at least two views: internal and
external.

The first is useful, for example, for publishing systems that want to
use this information to implement content editing applications.

The second is used for the public side of the publishing system (the one
that hides all the workflow stuff).

Still, we didn't get to handle queries, but this is already huge long
and it's time to go to bed :)

Ah, BTW, many could believe that what I'm asking for is really a CMS
rather than a native XML DB.

Well, I think this is the only way a native XML DB is going to be of any
use, so why don't we build this functionality in the DB to gain
performance and ease of use?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

[RT] Dreams for a useful database

Reply via email to