Hi Gerd,
thanks for bringing this up again!
Am 22.12.11 17:58, schrieb Gerd Schrick:
Since our last Lenya-meeting in Freiburg I'm seriously thinking about a
SQL-database based implementation of the content repository (DB-content-
repo) and would like to kindly ask you for your opinion, ideas, hints...
It will only be for Lenya 2 and is still in an early draft status.
Any honest feedback is highly appreciated.
Basic:
1) available as an alternative to the current file based repo
I think this is a very important point. IMO a straightforward migration
path is crucial to the success of the undertaking. If we can keep the
repository API untouched, i.e. the users don't have to make changes to
their client code, the chances will be much higher that the DB repo
implementation will be accepted.
2) It should be possible to choose the repository for each publication
(no switching between repositories, afterwards)
If the content structure can be clearly defined, it should be reasonably
easy to implement an import/export mechanism for either implementation,
which can be used to migrate content between the implementations. The
migration tool from Lenya 1.2 to Lenya 2.0 could be used as a basis
(traverse a content repository somewhere on the disk, write it into the
Lenya content repo using the repo API).
The idea is basicaly to "replace" the filesystem based storage and put
all the content into a DB.
The overall structure of how content is managed/stored in Lenya should
not be changed (only if really necessary)
What we hope to achieve is:
1) better performance with large publications (> 10.000 documents)
Yes, altough to increase the performance it would probably already help
a lot to add a couple of levels to our repository tree, analog to the
HTTPD disk cache (e.g. group documents sharing the first n characters of
the UUID in a common directory).
2) easy, fast and flexible way to do queries based on metadata
example: there's a metafield "categories" that stores the category keys
comma-separated like "c2,c17,c2006,c33" and we want to have a linklist on
the homepage that lists the latest 5 documents of a certain ("c17")
category. The documents can be found anywhere in the publication.
With SQL this can easily be done with something like "SELECT uuid FROM
metatable WHERE category LIKE '%c17,%' LIMIT 5"
Sure, such a case can be solved with Lucene as well, but I think that
there's much more flexibility to do something like this on-the-fly
(maybe an author while configuring a ContentUnit)
Lucene is also very flexible in this regard, and fairly speedy. You can
check out the tagcloud module for an example.
3) Deactivating and deleting documents takes very long in our large
publications due to the link-checking (as far as I understand what's
going on)
I have code in my sandbox which uses Lucene for this purpose,
eliminating the performance penalty. See also
http://thread.gmane.org/gmane.comp.cms.lenya.devel/24238
The code needs some polishing, but I can commit it somewhere if there is
sufficient interest.
- I assume a DB will be much faster finding all the items WHERE content
LIKE = '%lenya-document:112344...%'
This will probably still be fairly slow since a sequential scan on quite
a large amount of text is necessary. The approach described above uses
meta data to store outbound references. It would work with Lucene and an
RDMBS.
4) One problem when going for our cluster environment was the
performance of the shared filesystem (NFS) between the delivery nodes
(Lenya instances) - sharing the same DB won't be an issue
This is a good point.
4 a) better arrange with modern enterprise environments
Nowadays, there is usually no dedicated filesystem space provided for
applications. Instead, each application gets its share from a NAS which
cannot cope with the heavy FS requirements from Lenya.
This makes sense.
5) Maybe that also a clustered Authoring will be easier possible then?
It is probably easier to implement lock flags.
6) Better scalability
7) easier deployment
8) We could use the ACID functionality provided by the RDBMS instead of
our own transaction system.
The Database structure:
"documents" in this context is mainly what is curently stored in the
uuid folder with filename {language}
1) one database per publication
Why would you make it mandatory to use separate DBs for the
publications? I think this could be up to the user. Adding new
publications might be easier on the fly if all pubs share a common DB.
2) Documents table: one table for the documents in the Authoring (also
contains Trash and Archive) and one table for the documents in the Live
area
IMO one table with an area column would be sufficient.
3) MetaData: one table per MetaData set (dcterms, dcelements, ...), one
column per field. Per MetaData set there's as well one table relating to
the Live documents and one to the Authoring
Using separate tables for the meta data sets means that the DB schema
has to be altered to add new meta data sets. OTOH, using generic tables
makes it a bit harder to create efficient queries, and query building
tends to become more cumbersome.
4) Revision handling: (based on our experience over some years now, the
revisions are only used in the authoring area) so the idea is that in the
above mentioned tables only the current revisions are stored. To store
the older revisions of the documents and metadata there's another table for
each of the above metioned ones.
That would mean that data has to be moved if new revisions are added. If
we keep all revisions in a single table, we just have to create a new
record for each revision.
Example table structure:
documents_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
- workflow state
- last updated
- doctype (resource type)
- mimetype
- content_text (Textfield for the document content if it's a text type)
- content_blob (for binary content; pdf, images ...)
* = Primary key
metadata_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
-* metasetkey (to identify the metaset)
(- workflow state)?
- last updated
- metafield 1
:
- metafield n
We're looking forward for your valuable feedback and wish you all a
Merry Christmas and a joyful and happy New Year!
Thanks, the same to you :)
Best regards,
Andreas
--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lenya.apache.org
For additional commands, e-mail: dev-h...@lenya.apache.org