Now I am feeling the need for a personal semistructured store more than ever. This is a searchable database that supports "data first, structure later" <http://www.betaversion.org/~stefano/linotype/news/93/> (i.e. you can add a schema incrementally to existing data entered without one, or with a very primitive one), but supports enough structure to render web pages and things like that. Here are some options:
- RDF/Sniki-style: label edges in a directed graph with nodes, which have names. Define tables with Sniki-style queries. Add field metadata with edges from the field label node. Queries are different from edges; perhaps stored in nodes, perhaps defined in some manner by a local graph structure. Adding data to a query result is the most convenient way to add it. - UnQL-style: label edges in a directed graph with strings; nodes are anonymous; define queries by replacing regular expressions of edges with edges; display tables perhaps by going two levels deep from a particular node. (OEM, object exchange model, is essentially identical to this.) - Python/Perl/JavaScript-style: nodes are anonymous, but edges are uniquely labeled within a node with strings, and some nodes are special primitive types, like strings. Some edges are 'virtual' and computed on demand by arbitrary code that walks the neighborhood. There are nodes that are lists. - askSam-style: nodes are small text files, with a syntactic convention to specify fields. Queries are full-text searches, perhaps limited to certain fields; reports can extract values of certain fields to get tabular output. - XML-style: data is stored in a single hierarchy in which the children of each node are ordered, and each node is labeled with an element type and possibly some attributes; XPath or XQuery is the query language. Maybe you have hyperlinks by id to other nodes. - Wiki-style: a node is a small text file with a name; text formatted in a certain way inside the file represents a reference to another file. Very similar to AskSam-style, but existing Wikis either don't have a facility for fields with values or don't use it much. - wowbarbts-style: just like Wiki-style, but the node is a set of name-value pairs, and each value is potentially a small text file (which may contain references to other nodes.) Past versions are logged. - Prolog-style: data is viewed in a set of relations of fixed arity; any relation may contain both data items directly entered there by the user as well as data items "inferred", i.e. produced as query results, from other relations, and there might actually be an infinite number of such other data items. The data items may be atomic (i.e. strings) or node-labeled trees of atoms. - Lotus-Agenda-style: there's an implication hierarchy of categories (with the universe at the top), and any data item (a short string of text) may belong to any subset of the hierarchy, subject to a few validity constraints. Sets of categories within the tree form column values in reporting screens, categories form section headers in reporting screens, and the data items themselves are always displayed on reporting screens, which are also where you add new data items. In a few cases, belonging to a category may have associated with it some numeric or date value. Most category memberships are inferred from the text content of the data item, according to rules attached to the categories --- e.g. categorize anything containing the word "car" or "automotive" in the "automotive" category. - Excel-style: you have a big rectangular grid of data values; you can automatically hide rows in some subset that don't meet your criteria, and you have "quick filters" that give you menus of possible data values. Also you can hide whatever columns you want. Really simple and not very powerful but apparently sufficient for many people's uses. - mutated-Agenda-style: you have a big hierarchy of data items, each labeled either with a string or a hyperlink to some other data item. Category membership is denoted by a child labeled with a hyperlink to the category. Reports/queries can optionally be run over just the descendants or children of a single node, minimizing the global-data-store-interference problem observed with several of these models. - rumor-oriented style: data items are sets of name-value pairs; you can't modify old items, but only add new ones, which closely resemble input events; queries use the standard relational operations on some set of relations, one of which is the entire set of data items, and are evaluated in some namespace where you probably have defined most of the other names with standard queries. By evaluating a query or a set of queries in a different namespace, you can get effects like undo, deletion, update, or the elimination of all actions from a particular IP address. Using relational operations provides sequence-independence; the rumorset is not inherently sequenced. Must use lazily-updated materialized views for most queries in order to achieve reasonable performance. - zzstructure-style: each data item is a short string or other similar typed object; they are connected along 'axes', which have a linearity constraint --- if X follows Y along axis A, then nothing else can follow Y along that axis, and X must also precede Y along A. Tabular display is achieved by mapping two of the axes onto the screen's dimensions. You could probably do reasonable queries with regular expressions over axis directions and node tests, sort of like XPath or XQL. However, Ted Nelson is a patent pirate who wishes that such projects would stop, and he may have the legal means to kill them <http://lists.gnu.org/archive/html/gzz-dev/2003-02/msg00042.html> - del.icio.us-style: similar to Agenda, but without a hierarchy among categories and without any automatic categorization. - unix-filesystem-style: nodes are either strings, string-indexed dictionaries of nodes, or transparent hyperlinks to other nodes. Queries (tree regular expressions on path traversals with full-text search, or XPath, would probably suffice) produce lists of node pathnames, perhaps with associated excerpts. (It would be interesting to see full-fledged tools for viewing real Unix filesystems this way.) To essentially any of these styles, you could add any of the following features: - full-text search - incremental, interactive query result display (note that this probably implies that the query syntax should put field names after the field values being searched for; this seems obvious once stated, but does not seem to be a property of any existing query language) - arithmetic on fields (for queries and reports) - computed attributes (similar to arithmetic on fields, but perhaps more general, including query results) - imperative programming - Agenda-style automatic categorization - RDF-style namespacing - optimistically synchronized transactions - or at least locking - or at least some minimal atomic update - access to old versions of the data - indexing with Nilesh Dalvi's partial match index <ftp://ftp.cs.washington.edu/tr/2004/01/UW-CSE-04-01-01.pdf> and probably some other features I haven't thought of yet. These features may actually matter as much as the underlying data representation, and Excel shows that a nice UI for the simplest tasks compensates for many weaknesses. Into most of them, you could probably hack in some variant of the old Notes synchronization approach, recently codified by Ray Ozzie at Microsoft as Simple Sharing Extensions <http://msdn.microsoft.com/xml/rss/sse>; the rumor-oriented style <http://lists.canonical.org/pipermail/kragen-tol/2004-January/000749.html> gives you stronger guarantees about synchronization, at the cost of unbounded storage and unpredictable performance. Thoughts I've had on this problem over the years ------------------------------------------------ <http://del.icio.us/kragen/semistructured-data> 38 relevant web pages and my brief comments on them over the last year <http://lists.canonical.org/pipermail/kragen-tol/1999-August/000466.html> XML editors (how to make XML editing convenient so you can stand to take notes in it) <http://lists.canonical.org/pipermail/kragen-tol/1999-November/000486.html> XML query languages (abortive notes on a better path language for XML) <http://lists.canonical.org/pipermail/kragen-tol/2000-May/000580.html> patricia extensions (groping toward the idea I finally understood in the NFA-DFA post below) <http://lists.canonical.org/pipermail/kragen-tol/2000-September/000633.html> AskSam & outliners (sorry about the terrible formatting; will fix it sometime) <http://lists.canonical.org/pipermail/kragen-tol/2002-February/000689.html> grep on RFC-822 headers and stuff (a previous post expressing a wish for a semistructured data store, with six things to store in it) <http://lists.canonical.org/pipermail/kragen-discuss/2002-February/000750.html> patricia indexes on directed graphs (groping more toward the idea I finally understood in the NFA-DFA post) <http://lists.canonical.org/pipermail/kragen-tol/2003-January/000737.html> mvfs --- multivalued functions --- and query languages (me groping toward an understanding of semistructured query optimization) <http://lists.canonical.org/pipermail/kragen-hacks/2004-January/000384.html> viewing filesystems as filterable outlines (very rough initial prototype of a incremental full-text search UI) <http://lists.canonical.org/pipermail/kragen-tol/2004-February/000752.html> Patricia, UnQL, and NFA to DFA conversion (how to build a really large index of a graph to make path regexes fast) <http://lists.canonical.org/pipermail/kragen-tol/2005-October/000795.html> optimizing SQL in web apps by waiting until the last minute (proposing how to make web apps on a triple store fast) <http://lists.canonical.org/pipermail/kragen-hacks/2005-November/000420.html> semistructured data land: sorting paragraphs by aisle number <http://lists.canonical.org/pipermail/kragen-tol/2005-November/000809.html> folkschemas and semistructured data (how to make it easy to select field names and values) Other related stuff ------------------- wowbarbts embodies some of these ideas as well; it would be nice if it could end up getting published too, but I wouldn't be surprised if circumstances prevented that. The KB editor we used at Alpiri also embodied some of these ideas, but I didn't write it. Last year, a prominent XQuery researcher recommended Galax as the best available open-source XQuery implementation.