Hi Gerd,
I'm +1 with your proposal, but I also think that a JCR based content
repository could be more suitable and feature valuable (for example,
it's more easy to build a CMIS endpoint on a JCR backend than others).
One of the key point I see in this "another repository implementation"
initiative is an opportunity to really test the repository API and the
independence of the repository layer with others.
More than the database schema, I think that an important work will be to
re-factor / improve some part of the repository layer to make it really
independent and easy to extend / implement.
This is a important task as with this "clean" repository layer we can
then more easily implement other repository tech (as for exemple a GIT
based one, a JCR, a no-sql, ...).
Related to your SQL proposal, I have two thoughts :
1) Do you have in mind the use of an ORM framework (like :
http://cayenne.apache.org/) or not ? Writing SQL requests can be
sometimes a pita, and ORM can be a hummer.
2) What about "user defined metadata" ? Lenya offer the ability to add
specific metadata's informations for specifics document type. If I well
understand your DB schema, this mean add some column to the metadata
table, that's seems not really clean and easy to manage...
Take us in touch with this project, it's so good !
++
On 12/22/2011 05:58 PM, Gerd Schrick wrote:
Dear Lenya devs,
Since our last Lenya-meeting in Freiburg I'm seriously thinking about a
SQL-database based implementation of the content repository (DB-content-
repo) and would like to kindly ask you for your opinion, ideas, hints...
It will only be for Lenya 2 and is still in an early draft status.
Any honest feedback is highly appreciated.
Basic:
1) available as an alternative to the current file based repo
2) It should be possible to choose the repository for each publication
(no switching between repositories, afterwards)
The idea is basicaly to "replace" the filesystem based storage and put
all the content into a DB.
The overall structure of how content is managed/stored in Lenya should
not be changed (only if really necessary)
What we hope to achieve is:
1) better performance with large publications (> 10.000 documents)
2) easy, fast and flexible way to do queries based on metadata
example: there's a metafield "categories" that stores the category keys
comma-separated like "c2,c17,c2006,c33" and we want to have a linklist on
the homepage that lists the latest 5 documents of a certain ("c17")
category. The documents can be found anywhere in the publication.
With SQL this can easily be done with something like "SELECT uuid FROM
metatable WHERE category LIKE '%c17,%' LIMIT 5"
Sure, such a case can be solved with Lucene as well, but I think that
there's much more flexibility to do something like this on-the-fly
(maybe an
author while configuring a ContentUnit)
3) Deactivating and deleting documents takes very long in our large
publications due to the link-checking (as far as I understand what's
going on)
- I assume a DB will be much faster finding all the items WHERE content
LIKE = '%lenya-document:112344...%'
4) One problem when going for our cluster environment was the
performance of the shared filesystem (NFS) between the delivery nodes
(Lenya
instances) - sharing the same DB won't be an issue
4 a) better arrange with modern enterprise environments
Nowadays, there is usually no dedicated filesystem space provided for
applications. Instead, each application gets its share from a NAS which
cannot cope with the heavy FS requirements from Lenya.
5) Maybe that also a clustered Authoring will be easier possible then?
6) Better scalability
7) easier deployment
The Database structure:
"documents" in this context is mainly what is curently stored in the
uuid folder with filename {language}
1) one database per publication
2) Documents table: one table for the documents in the Authoring (also
contains Trash and Archive) and one table for the documents in the Live
area
3) MetaData: one table per MetaData set (dcterms, dcelements, ...), one
column per field. Per MetaData set there's as well one table relating to
the Live documents and one to the Authoring
4) Revision handling: (based on our experience over some years now, the
revisions are only used in the authoring area) so the idea is that in the
above mentioned tables only the current revisions are stored. To store
the older revisions of the documents and metadata there's another table for
each of the above metioned ones.
Example table structure:
documents_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
- workflow state
- last updated
- doctype (resource type)
- mimetype
- content_text (Textfield for the document content if it's a text type)
- content_blob (for binary content; pdf, images ...)
* = Primary key
metadata_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
-* metasetkey (to identify the metaset)
(- workflow state)?
- last updated
- metafield 1
:
- metafield n
We're looking forward for your valuable feedback and wish you all a
Merry Christmas and a joyful and happy New Year!
Best regards,
Gerd and Hans
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lenya.apache.org
For additional commands, e-mail: dev-h...@lenya.apache.org