Hi all, this topic is of great interest for me too.

Me and a couple of friends we have been developing a brand new CMS
system using the .NET framework. We are using a relational database to
store and manage metadata. We had a couple of brainstorm around how to
store metadata effectively. We all had already experience with multiple
types of storage systems. Our main concern was:

1) Metadata query performance 
2) Metadata updates performance
3) Minimal metadata replication
4) Keeping up with rapid schema updates
5) Dynamic query schema generation (Dynamic SQL and XPath)

Although it is not the most perfect solution for all problems we came up
with this:

We classified metadata usage patterns in to two classes:

1) Metadata around document structure (DocBook Schema, etc)
2) Metadata around structured taxonomies or ontologisms (Dublin Core)

We decided to map attributes in structured taxonomies to table fields.
This takes the most out of SQL capabilities for queries over metadata
that including the ones that makes reference to metadata of other types
of content (straight forward mapping)

We decided to metadata around document structure into a field in the
table. 

Example (using Tony example):

        XML for Company Employees (HR):
        <Company Name="Acme Corp">
                <Department Name="GR">
                        <Director Name="Bill">
                        <Employee Name="Carl" Age="30">
                                <htmlDesc>blab la bla<htmlDesc>
                        </Employee>
                        ....
                </Department>
                ....
        </Company>

        It seams that this info is quite hierarchical but it is not. In
fact it is equivalent to:

        A Relational Mapping:

        Company

        Id      Column Type  Column Name
        1       int              Age
        2       nvarchar(30) Name       
        3     nvarchar(30) CompanyName 
        4       varchar  htmlDesc
        5     varchat(30)  Director
        6       varchar(30)  Department

        * This does not comply with 2nd normal form although the XML
might be seen as 2nd normal form compliant. The correct relational
mapping would be with 3 tables - [Company], [Department] and [Employee].
Notice, although the XML Schema for the example above seem much more
flexible, the fact is that it is much more simple to add attributes
locally on a table ([Employee] for instance) then in the XML Schema (try
it in a complex XML Schema, and then "debug" the problem). 

In this example we have both types of metadata. The field htmlDesc is on
class 2) and the rest is on class 1).

To cope with schema changes we are building a Content Model Editor
(defining meta data attributes) that "automagically" changes the
underlying tables when necessary. It does make the changes immediately;
in fact it stores the schema in an intermediary internal XML file and
stores it in a table. The interesting thing is that the table where the
XML file (with the shena definition) is stored was also defined with the
Content Model Editor (it has its own schema hold on a internal file that
looks exactly like a schema for any other content), so it inherits all
the benefits of the storage system managed by the CMS (revision ,
workflow, security, etc etc).

We tackle multi values in the following manner. Instead of looking at
multi values as a needed fact, we have analyzed their usage patterns
within the scope of CMS. Quite often multi values are used to define
semantic categories (World(Region(Continent(Country(Country
Region(City))).

When querying this is a simple conjuction:

Look for companies in:

[Region] AND [Continent] AND [Country] AND [Country Region]

To this, we decided to implement multi values in the following manner.

For each node in the hierarchy (Facet) we add an attribute to the table.
The problem of this approach is that we loose some logic (City is within
Country Region etc etc) in the data model itself (RDBMS). But this logic
is maintained by the CMS in the following manner.

Each field in the DB is prefixed with MVP_ (Multi Value Part), followed
by the name of the taxonomy for the MV field, and then the actual name
of the node:

MVP_GEO_REGION
MVP_GEO_CONTINENT
MVP_GEO_COUNTRY
....

Usually users will input multi valued using a folder based interface
rather doing it by writing, so the logic can be constrained by the CMS
and presented to the user according to its constraints.

One can argue again that the table will not be in the 2nd normal form,
so it not. But in fact we do better. Each MVP (Multi Value Part) has its
own table. We only store ID's within the fields of the record.

This sounds complex to maintain but Content Model Editor, hides all this
complexity for the user and the developer.

In this scheme queries are really fast over structured metadata (both
multi valued and single valued). Furthermore, we don't loose the
benefits of relational modeling such as the ability to separate data
into multiple related atomic information units irrespective of
representation. Also we don't loose the performance benefits offered by
the RDBMS query engine, not to mention performance tuning facilities.

The approach to multi values is really powerful. Consider not only one
multi valued field such as GEO but multiple such as Political Map (ONU,
UN etc). One can easily produce queries that join multiple hierarchies
very easily.

For instance "Get all companies that are in settled in Europe but not in
EU".

MV_GEO_CONTINENT = "Europe" AND NOT MV_POLITICAL_ORG="EU"

Compare this with pure XPath :)

Hope it helped.

Best regards,

Nuno Lopes

PS: The secret of a good solution is in hiding complexity from the user
not avoiding it.

--
http://cms-list.org/
more signal, less noise.

Reply via email to