Hi Gerd,

I'm +1 with your proposal, but I also think that a JCR based content repository could be more suitable and feature valuable (for example, it's more easy to build a CMIS endpoint on a JCR backend than others).

One of the key point I see in this "another repository implementation" initiative is an opportunity to really test the repository API and the independence of the repository layer with others.

More than the database schema, I think that an important work will be to re-factor / improve some part of the repository layer to make it really independent and easy to extend / implement. This is a important task as with this "clean" repository layer we can then more easily implement other repository tech (as for exemple a GIT based one, a JCR, a no-sql, ...).

Related to your SQL proposal, I have two thoughts :

1) Do you have in mind the use of an ORM framework (like : http://cayenne.apache.org/) or not ? Writing SQL requests can be sometimes a pita, and ORM can be a hummer.

2) What about "user defined metadata" ? Lenya offer the ability to add specific metadata's informations for specifics document type. If I well understand your DB schema, this mean add some column to the metadata table, that's seems not really clean and easy to manage...

Take us in touch with this project, it's so good !

++


On 12/22/2011 05:58 PM, Gerd Schrick wrote:
Dear Lenya devs,

Since our last Lenya-meeting in Freiburg I'm seriously thinking about a
SQL-database based implementation of the content repository (DB-content-

repo) and would like to kindly ask you for your opinion, ideas, hints...

It will only be for Lenya 2 and is still in an early draft status.

Any honest feedback is highly appreciated.


Basic:
1) available as an alternative to the current file based repo
2) It should be possible to choose the repository for each publication
(no switching between repositories, afterwards)

The idea is basicaly to "replace" the filesystem based storage and put
all the content into a DB.
The overall structure of how content is managed/stored in Lenya should
not be changed (only if really necessary)

What we hope to achieve is:
1) better performance with large publications (> 10.000 documents)
2) easy, fast and flexible way to do queries based on metadata
example: there's a metafield "categories" that stores the category keys
comma-separated like "c2,c17,c2006,c33" and we want to have a linklist on

the homepage that lists the latest 5 documents of a certain ("c17")
category. The documents can be found anywhere in the publication.
With SQL this can easily be done with something like "SELECT uuid FROM
metatable WHERE category LIKE '%c17,%' LIMIT 5"
Sure, such a case can be solved with Lucene as well, but I think that
there's much more flexibility to do something like this on-the-fly
(maybe an

author while configuring a ContentUnit)
3) Deactivating and deleting documents takes very long in our large
publications due to the link-checking (as far as I understand what's
going on)

- I assume a DB will be much faster finding all the items WHERE content
LIKE = '%lenya-document:112344...%'
4) One problem when going for our cluster environment was the
performance of the shared filesystem (NFS) between the delivery nodes
(Lenya

instances) - sharing the same DB won't be an issue
4 a) better arrange with modern enterprise environments
Nowadays, there is usually no dedicated filesystem space provided for
applications. Instead, each application gets its share from a NAS which

cannot cope with the heavy FS requirements from Lenya.
5) Maybe that also a clustered Authoring will be easier possible then?
6) Better scalability
7) easier deployment

The Database structure:
"documents" in this context is mainly what is curently stored in the
uuid folder with filename {language}
1) one database per publication
2) Documents table: one table for the documents in the Authoring (also
contains Trash and Archive) and one table for the documents in the Live

area
3) MetaData: one table per MetaData set (dcterms, dcelements, ...), one
column per field. Per MetaData set there's as well one table relating to

the Live documents and one to the Authoring
4) Revision handling: (based on our experience over some years now, the
revisions are only used in the authoring area) so the idea is that in the

above mentioned tables only the current revisions are stored. To store
the older revisions of the documents and metadata there's another table for

each of the above metioned ones.

Example table structure:
documents_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
- workflow state
- last updated
- doctype (resource type)
- mimetype
- content_text (Textfield for the document content if it's a text type)
- content_blob (for binary content; pdf, images ...)

* = Primary key

metadata_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
-* metasetkey (to identify the metaset)
(- workflow state)?
- last updated
- metafield 1
:
- metafield n


We're looking forward for your valuable feedback and wish you all a
Merry Christmas and a joyful and happy New Year!

Best regards,
Gerd and Hans


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lenya.apache.org
For additional commands, e-mail: dev-h...@lenya.apache.org

Reply via email to