Disclaimer: these are not only random thoughs, but are also strong opinions. I've been wrong in the past and might be wrong even here. If you think so, please, make yourself heard: I'd love to find myself wrong sooner rather later, at least, before we spend a bunch of energy and time to implement something that isn't useful.
This RT is, in fact, a way to trigger discussion before people start writing something that might end up being more harmful than useful (expecially in the XUpdate direction which I found a bad direction, as my previous messages outlined). - o - Let me start outlining a few concepts: 1) I believe that XML databases are useful for semi-structured data only. If you have data that fits into a relational DBMS, help yourself: use your relational DBMS. Period. Using a native XML DB to store relational data, or data that could easily be described by a simple and effective relational model, is, IMHO, not only wrong: it's a technological suicide. Documents, reports, books, articles, vector graphics, multimedia interactive animations, 3d world descriptors, math formulas, chemical formulas are all examples of semi-structured data. Genetic footprints, topic maps, invoices, stock quotes are all examples of fully structured data, even if very well described by XML markup. Don't let the syntax fool you: it's the schema that counts. If the schema describes a fixed set of elements, not matter how complex, you should use relational mapping. Otherwise, but this is much less common that it first appears, and only otherwise you should entering the native XML DB land. When you hit the wall on native XML DB and somebody shows you how faster, simpler and more elegant would be to use a relational mapping for it, don't tell me I didn't warn you. - o - Ok, let us suppose we have a bunch of semi-structured data we want to store someplace and be able to retrieve it later, possibly using some powerful xml-based query language (such as XQuery) that goes as deep as single-node granularity (which is something that RDBMS can hardly optimize). Let's define this database environment of my dreams: 1) it should be possible to see it as a single persistent tree of DOM nodes. 2) it should contain node-level version information and should provide a tagging concept (here, think as a parallel between CVS files and these DOM nodes). 3) it should be namespace friendly (every node should have a namespace where it is meaninful). 4) it should be ID friendly (document IDs should be virtualized since the XML spec says that you can't have more than one element with the same ID. 5) it should provide XQuery capabilities 6) it should provide fine-grain access control 7) it should provide the ability to store binary objects as well (encoded as CDATA sections or xlinked) Think of a tree-shaped XML-only CVS with granularity up to the single node and powerful searching capabilities. This is what I dream to see Apache XIndice become. - o - Now we have the data and we have the database: how do we insert this data into it? Here is a list of my requirements for such an operation: 1) it should be as trivial and intuitive as possible. In theory, it should be as easy as saving a file on a file system. 2) it should suppor the complete XML infoset, thus should not limit the XML functionality. 3) it should not mix concerns: inserting data is *separate* from querying data. Even transforming existing data is another concern. One possible way of doing this is to provide a virtual file system view of the database: I've seen two commercial products that did this: 1) Oracle Portal-2-Go (now dead) used a virtual FTP server on top of the persistent DOM implementation. I've done consulting for them in Sweden (which probably resulted in the ultimate project death) and they showed me writing a document using Emacs over FTP, getting it validated on the fly and stored into the DB (guess what DB they used :) so they were forced to implement a very abstract child-parent relationship model which resulted in tables with millions of entries. The FTP server alone was some 50000 lines of java code. It was 1999. They still don't know what to do (I bet you they are going to wrap Cocoon and sell it once its recognized as mainstream, just like they did for JServ on their app server) They used to be a research-oriented software company. Oh well. 2) Software AG Tamino. They wrapped Tomcat and its WebDAV servlet on top of it so that they now provide WebDAV support. Never tried it myself nor had any contact with them so I can't really judge their implementation. This summer, when I was thinking about the "global CMS" I thought that the WebDAV view was very cool since it provided a nice metaphore for people to work on and a simple interface to implement (almost all OS provide some WebDAV functionalities and I expect this to grow even more). While I still think a WebDAV interface over such a DB could be useful, I came to the conclusion that is still not easy enough for the majority of the users of such a CMS. See the other inline editor thread: people ask the CMS for "what" and they should do the "what->where" mapping. For WebDAV, we could make this mapping as trivial as possible, but we can't remove the freedom to choose which folder to save your stuff on. It's not obvious for technicians used to this kind of "what->where" reasoning to see this as a problem, but it can be shown that the disk metaphore is extremely poor for storage systems which require solid positioning contracts. Why? well, because you break the first law of usability: never give you user more freedom than they need to get their job done. If you are a system administrator shuffling things around, you *need* that much freedom. But if you are a content writer you don't and placing the right content in the wrong location could ruin the entire effort since this data could not be visible in the CMS. Sure, even power users make mistakes so access control should be mapped with on fly validation attached to that folder-behaving node that could reject invalid concent. But some users should not be given this freedom to choose where to store stuff. Moreover: if we take into account versioning, revisioning and workflow control, a file system metaphore is ultimately poor. Having a WebDAV view might be useful for hard-core power users, but we must come up with something else for a good CMS. - o - I came to the conclusion that what we need is an API, then you design the application that uses the database features as you like. This goes into the 'toolkit' approach, rather than giving you a pre-packaged solution. It's the "framework" approach, the one I like best (in case you didn't already know :) Let's write some pseudocode (note: I don't know the XML:DB API just yet, I'm making this up entirely to show the concept): try { Database db = DatabaseDiscovery.lookup("host", "mydb", "username", "password"); Location location = db.locate("/news/europe/italy/sport/football/"); OutputStream os = location.getOutputStream("log message"); ... } catch (AccessDeniedException e) { ... } catch (ValidityException e) { ... } catch (WellformnessException e) { ... } catch (DatabaseException e) { ... } Here are a few problems with this approach: 1) lack of namespaces support in note location: if we whan to locate namespaced nodes, for example db.locate("/news/geo:europe/geo:italy/sport/football/"); we have to indicate the prefix->uri mapping. A possibility is: db.setNamespace("geo","http://www.geography.org/..."); db.locate("/news/geo:europe/geo:italy/sport/football/"); But I have the feeling this is getting into FS since the 'container nodes' should be all made by special a special namespace in order to allow a simple and valid file system abstraction on top of it <db:database xmlns:db="..."> <db:news> <db:europe> <db:italy> <db:sport> <db:football> <news:news xmlns:news="..." date="20010223"> ... </news:news> </db:football> </db:sport> </db:italy> </db:europe> </db:news> </db:database> or course, to simplify usage, the db.location() method could automatically use the DB namespace to locate nodes or use directly internal indexes to get to the requested location. With this, we could have a simple yet very solid way to discriminate between "data nodes" (the file system equivalent of files) and "location nodes" (the file system equivalent of folders). This would also turn the location path a normal path instead of an XPath, since we wouldn't need that functionality for data inserting (at least, I can't see any good reason to have it) 2) lack of inserting action indication Suppose we use the above concept to separate db: nodes from data nodes, then we must have a way to indicate "how" the data is inserted: we have a few choices: a) element is prepended b) element is overwritten c) element is appended Note: since we have revisioning, overwriting actually means storing a different version on top. Data should *never* be removed from the database (as in CVS). I think that prepending/appending doesn't make sense at all: you shouldn't count on the cardinal location of your element at retrieval time, so it should not give you the ability to choose what to do with it. Element location makes sense for "inter-document" updates, but I think the concept is broken very early in design: in order to come up with an XUpdate-like document, you need the original document, the changed document, create a xupdate diff, submit that and have the db handle the changes. XUpdate would make sense if diffs were generated without information on the previous data, but this is almost *never* the case, so I think it's much more sane to insert full documents and let the database handle the overwriting/appending depending the specified inserting action. So, suppose we have <db:database xmlns:db="..."> <db:news> <db:europe> <db:italy> <db:sport> <db:football> <news:news xmlns:news="..." date="20010223"> ... </news:news> </db:football> </db:sport> </db:italy> </db:europe> </db:news> </db:database> then we have <news xmlns="..." date="20010410"> ... </news> and we want to append it as another news, we have two choices: 1) configure the DB with what ID attribute is to be expected for this location (XMLSchema already provides some functionality for this) 2) explicitly state this from the API. The first solution fully separates concerns but makes db configuration critical: setting the wrong ID completely breaks the system and we might end up with a thousand versions of the same news, instead of thousand news with one version each. This solution doesn't require any change in the above code: it's the DB that checks if the inserted news has the same ID (the file system equivalent of file name) or not. If the ID is the same, a new version is created on top (like CVS does), if no ID is already present under that node, the node is appended. The second solution sounds easier, but mixes concerns since the programmer is now responsible to drive the inserting behavior of that DB location. Code would be something like Database db = DatabaseDiscovery.lookup("host", "mydb", "username", "password"); Location location = db.locate("/news/europe/italy/sport/football/"); location.setAction("insert|overwrite"); OutputStream os = location.getOutputStream("log message"); I far prefer the first, also because it makes it easier to implement it as a Source since the resulting URI is easier. db:username:password//host/mydb/news/europe/italy/sport/football/?log - o - Let me sum up the resulting inserting behavior: 1) the db is composed of db nodes (metadata) and user nodes (data), namespaces separate the two. 2) there is ability to insert whole documents only and only descending from a db node. This completely removes the need for XUpdate-like languages. 3) the DB uses db:path+ID to discriminate between documents. 4) the DB is capable of managing revisions of entire documents, probably by saving diffs instead of full trees (but this is an implementation detail) One last thing is missing: 1) in order to allow workflow management, we should include in the root node of any document some namespaced metadata that indicates the status of the document in the workflow. A db dump could be something like this: [internal view] <database xmlns="path-namespace" xmlns:db="db-namespace"> <articles> ... <db:versions db:ID="SM - 20010223 - My Article"> <db:version db:number="1.0" db:status="published"> <article xmlns="..."> <author name="Stefano Mazzocchi" id="SM"/> <title>My Article</title> <body> <para>...</para> ... </body> </article> </db:version> <db:version db:number="1.1" db:status="pending"> <article xmlns="..."> <author name="Stefano Mazzocchi" id="SM"/> <title>My Article</title> <body> <para>...</para> ... </body> </article> <db:comment date="20010225" by="SM"> <xhtml:p>I changed the second section as you suggested.</xhtml:p> </db:comment> <db:comment date="20010225" by="ZZ"> <xhtml:p>Yuck! c'mon, Stefano, you can do better than this!</xhtml:p> </db:comment> </db:version> </db:versions> ... </articles> </database> this requires: 1) a pretty powerful way to come up with a complex-type ID for each node in a automatic way. 2) a way to send two streams, one for content and one for comments. Sure, we could use strings, but then we could miss the ability to add stuff like visual information on where the comments can be visually presented (as in Adobe Acrobat or EQuill) and have graphic capabilities (like drown arrows, sticky post-it notes, etc). NOTE: the above dump is the "internal" structure of the DB, if we ask for the "public" view of the DB (yes, we could use the view concept here as well, in fact, I got it from the db world). [public view] <database xmlns="path-namespace" xmlns:db="db-namespace"> <articles> ... <article xmlns="..."> <author name="Stefano Mazzocchi" id="SM"/> <title>My Article</title> <body> <para>...</para> ... </body> </article> ... </articles> </database> which hides all the versioning/workflow information. So, the database should provide at least two views: internal and external. The first is useful, for example, for publishing systems that want to use this information to implement content editing applications. The second is used for the public side of the publishing system (the one that hides all the workflow stuff). Still, we didn't get to handle queries, but this is already huge long and it's time to go to bed :) Ah, BTW, many could believe that what I'm asking for is really a CMS rather than a native XML DB. Well, I think this is the only way a native XML DB is going to be of any use, so why don't we build this functionality in the DB to gain performance and ease of use? -- Stefano Mazzocchi One must still have chaos in oneself to be able to give birth to a dancing star. <[EMAIL PROTECTED]> Friedrich Nietzsche -------------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]