Re: DataBase Content Repository Implementation (Lenya 2)

Andreas Hartmann Sat, 24 Dec 2011 13:59:20 -0800

Hi Gerd,

thanks for bringing this up again!


Am 22.12.11 17:58, schrieb Gerd Schrick:

Since our last Lenya-meeting in Freiburg I'm seriously thinking about a
SQL-database based implementation of the content repository (DB-content-

repo) and would like to kindly ask you for your opinion, ideas, hints...

It will only be for Lenya 2 and is still in an early draft status.

Any honest feedback is highly appreciated.


Basic:
1) available as an alternative to the current file based repo

I think this is a very important point. IMO a straightforward migrationpath is crucial to the success of the undertaking. If we can keep therepository API untouched, i.e. the users don't have to make changes totheir client code, the chances will be much higher that the DB repoimplementation will be accepted.

2) It should be possible to choose the repository for each publication
(no switching between repositories, afterwards)

If the content structure can be clearly defined, it should be reasonablyeasy to implement an import/export mechanism for either implementation,which can be used to migrate content between the implementations. Themigration tool from Lenya 1.2 to Lenya 2.0 could be used as a basis(traverse a content repository somewhere on the disk, write it into theLenya content repo using the repo API).

The idea is basicaly to "replace" the filesystem based storage and put
all the content into a DB.
The overall structure of how content is managed/stored in Lenya should
not be changed (only if really necessary)

What we hope to achieve is:
1) better performance with large publications (> 10.000 documents)

Yes, altough to increase the performance it would probably already helpa lot to add a couple of levels to our repository tree, analog to theHTTPD disk cache (e.g. group documents sharing the first n characters ofthe UUID in a common directory).

2) easy, fast and flexible way to do queries based on metadata
example: there's a metafield "categories" that stores the category keys
comma-separated like "c2,c17,c2006,c33" and we want to have a linklist on
the homepage that lists the latest 5 documents of a certain ("c17")
category. The documents can be found anywhere in the publication.
With SQL this can easily be done with something like "SELECT uuid FROM
metatable WHERE category LIKE '%c17,%' LIMIT 5"
Sure, such a case can be solved with Lucene as well, but I think that
there's much more flexibility to do something like this on-the-fly
(maybe an author while configuring a ContentUnit)

Lucene is also very flexible in this regard, and fairly speedy. You cancheck out the tagcloud module for an example.

3) Deactivating and deleting documents takes very long in our large
publications due to the link-checking (as far as I understand what's
going on)

I have code in my sandbox which uses Lucene for this purpose,eliminating the performance penalty. See also


http://thread.gmane.org/gmane.comp.cms.lenya.devel/24238

The code needs some polishing, but I can commit it somewhere if there issufficient interest.

- I assume a DB will be much faster finding all the items WHERE content
LIKE = '%lenya-document:112344...%'

This will probably still be fairly slow since a sequential scan on quitea large amount of text is necessary. The approach described above usesmeta data to store outbound references. It would work with Lucene and anRDMBS.

4) One problem when going for our cluster environment was the
performance of the shared filesystem (NFS) between the delivery nodes
(Lenya instances) - sharing the same DB won't be an issue


This is a good point.

4 a) better arrange with modern enterprise environments
Nowadays, there is usually no dedicated filesystem space provided for
applications. Instead, each application gets its share from a NAS which
cannot cope with the heavy FS requirements from Lenya.


This makes sense.

5) Maybe that also a clustered Authoring will be easier possible then?


It is probably easier to implement lock flags.

6) Better scalability
7) easier deployment

8) We could use the ACID functionality provided by the RDBMS instead ofour own transaction system.


The Database structure:
"documents" in this context is mainly what is curently stored in the
uuid folder with filename {language}
1) one database per publication

Why would you make it mandatory to use separate DBs for thepublications? I think this could be up to the user. Adding newpublications might be easier on the fly if all pubs share a common DB.

2) Documents table: one table for the documents in the Authoring (also
contains Trash and Archive) and one table for the documents in the Live
area


IMO one table with an area column would be sufficient.

3) MetaData: one table per MetaData set (dcterms, dcelements, ...), one
column per field. Per MetaData set there's as well one table relating to
the Live documents and one to the Authoring

Using separate tables for the meta data sets means that the DB schemahas to be altered to add new meta data sets. OTOH, using generic tablesmakes it a bit harder to create efficient queries, and query buildingtends to become more cumbersome.

4) Revision handling: (based on our experience over some years now, the
revisions are only used in the authoring area) so the idea is that in the
above mentioned tables only the current revisions are stored. To store
the older revisions of the documents and metadata there's another table for
each of the above metioned ones.

That would mean that data has to be moved if new revisions are added. Ifwe keep all revisions in a single table, we just have to create a newrecord for each revision.

Example table structure:
documents_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
- workflow state
- last updated
- doctype (resource type)
- mimetype
- content_text (Textfield for the document content if it's a text type)
- content_blob (for binary content; pdf, images ...)

* = Primary key

metadata_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
-* metasetkey (to identify the metaset)
(- workflow state)?
- last updated
- metafield 1
:
- metafield n


We're looking forward for your valuable feedback and wish you all a
Merry Christmas and a joyful and happy New Year!


Thanks, the same to you :)

Best regards,
Andreas


--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lenya.apache.org
For additional commands, e-mail: dev-h...@lenya.apache.org

Re: DataBase Content Repository Implementation (Lenya 2)

Reply via email to