Hi all, this topic is of great interest for me too. Me and a couple of friends we have been developing a brand new CMS system using the .NET framework. We are using a relational database to store and manage metadata. We had a couple of brainstorm around how to store metadata effectively. We all had already experience with multiple types of storage systems. Our main concern was:
1) Metadata query performance 2) Metadata updates performance 3) Minimal metadata replication 4) Keeping up with rapid schema updates 5) Dynamic query schema generation (Dynamic SQL and XPath) Although it is not the most perfect solution for all problems we came up with this: We classified metadata usage patterns in to two classes: 1) Metadata around document structure (DocBook Schema, etc) 2) Metadata around structured taxonomies or ontologisms (Dublin Core) We decided to map attributes in structured taxonomies to table fields. This takes the most out of SQL capabilities for queries over metadata that including the ones that makes reference to metadata of other types of content (straight forward mapping) We decided to metadata around document structure into a field in the table. Example (using Tony example): XML for Company Employees (HR): <Company Name="Acme Corp"> <Department Name="GR"> <Director Name="Bill"> <Employee Name="Carl" Age="30"> <htmlDesc>blab la bla<htmlDesc> </Employee> .... </Department> .... </Company> It seams that this info is quite hierarchical but it is not. In fact it is equivalent to: A Relational Mapping: Company Id Column Type Column Name 1 int Age 2 nvarchar(30) Name 3 nvarchar(30) CompanyName 4 varchar htmlDesc 5 varchat(30) Director 6 varchar(30) Department * This does not comply with 2nd normal form although the XML might be seen as 2nd normal form compliant. The correct relational mapping would be with 3 tables - [Company], [Department] and [Employee]. Notice, although the XML Schema for the example above seem much more flexible, the fact is that it is much more simple to add attributes locally on a table ([Employee] for instance) then in the XML Schema (try it in a complex XML Schema, and then "debug" the problem). In this example we have both types of metadata. The field htmlDesc is on class 2) and the rest is on class 1). To cope with schema changes we are building a Content Model Editor (defining meta data attributes) that "automagically" changes the underlying tables when necessary. It does make the changes immediately; in fact it stores the schema in an intermediary internal XML file and stores it in a table. The interesting thing is that the table where the XML file (with the shena definition) is stored was also defined with the Content Model Editor (it has its own schema hold on a internal file that looks exactly like a schema for any other content), so it inherits all the benefits of the storage system managed by the CMS (revision , workflow, security, etc etc). We tackle multi values in the following manner. Instead of looking at multi values as a needed fact, we have analyzed their usage patterns within the scope of CMS. Quite often multi values are used to define semantic categories (World(Region(Continent(Country(Country Region(City))). When querying this is a simple conjuction: Look for companies in: [Region] AND [Continent] AND [Country] AND [Country Region] To this, we decided to implement multi values in the following manner. For each node in the hierarchy (Facet) we add an attribute to the table. The problem of this approach is that we loose some logic (City is within Country Region etc etc) in the data model itself (RDBMS). But this logic is maintained by the CMS in the following manner. Each field in the DB is prefixed with MVP_ (Multi Value Part), followed by the name of the taxonomy for the MV field, and then the actual name of the node: MVP_GEO_REGION MVP_GEO_CONTINENT MVP_GEO_COUNTRY .... Usually users will input multi valued using a folder based interface rather doing it by writing, so the logic can be constrained by the CMS and presented to the user according to its constraints. One can argue again that the table will not be in the 2nd normal form, so it not. But in fact we do better. Each MVP (Multi Value Part) has its own table. We only store ID's within the fields of the record. This sounds complex to maintain but Content Model Editor, hides all this complexity for the user and the developer. In this scheme queries are really fast over structured metadata (both multi valued and single valued). Furthermore, we don't loose the benefits of relational modeling such as the ability to separate data into multiple related atomic information units irrespective of representation. Also we don't loose the performance benefits offered by the RDBMS query engine, not to mention performance tuning facilities. The approach to multi values is really powerful. Consider not only one multi valued field such as GEO but multiple such as Political Map (ONU, UN etc). One can easily produce queries that join multiple hierarchies very easily. For instance "Get all companies that are in settled in Europe but not in EU". MV_GEO_CONTINENT = "Europe" AND NOT MV_POLITICAL_ORG="EU" Compare this with pure XPath :) Hope it helped. Best regards, Nuno Lopes PS: The secret of a good solution is in hiding complexity from the user not avoiding it. -- http://cms-list.org/ more signal, less noise.