Re: [Dspace-tech] Differences between the data model and the trunk

Mark Diggory Tue, 20 Nov 2007 04:43:43 -0800

<this somehow bounced from the tech list yesterday>

I throw up a flag of caution. I think there are concerns with  
unpredictable behavior of authorization/policies when such a behavior  
is allowed (many to many) and we see that in the the database schema  
where we have added a column to "Item" pertaining to its "owning  
collection".  We should be careful that the issues that arise around  
policy enforcement are not eroded by exposing the many to many  
relationships found in the model/database to the business logic  
layer.  There is a bit of duct tape over this model that keeps things  
from getting out of control which we need to be aware of and I'm not  
sure if it works as well for communities as it does for collections.


As well, we could really use someone looking at database  
"normalization" in areas like this.. I.E. should the Item table  
contain any reference to its owning collection, or should that be in  
another table? It happens again in Bundle with "primary bitstream  
id". Here the Bundle table references a bitstream outside of the  
"Bundle2Bitstream" table.  I think its great to talk about the model  
by looking that the database schema itself. So I've inlined some of  
the important structures below

It may be helpful to our 2.0 efforts to consider an above  
"normalization" effort across the whole database to assure that we  
are doing things in the best possible approach (this isn't science,  
this is just good database design).  For instance, one might consider  
that we need to normalize much of the "workflow" expressed in the  
Collection table by separating it out such that its in another table  
referring to the Collection table. In fact, we can apply this logic  
all over the database, After which we will begin to see that  
Community, Collection, Item, Bundle and Bitstream tables are really  
converging on the same schema structure (they should be, they are all  
DSOs) and that the structure would really only end up holding those  
attributes that are in common across those objects.  Now, I'm not  
saying its ideal to put them all into the same "table", separate  
tables are an optimization of container vs contained.  Many  
containers have very fewer members which are accessed at a higher  
frequency.  What I am saying is that there is a "template" for DSOs  
that we might work to make more formal for 2.0.  And that we will be  
expressing things like the "structure" of the relationships between  
DSO's separately from the attachment of "properties" or "metadata"  
and that the decisions about this "structure" should be part of the  
"core" dspace teams mandate.  Daniele, thanks for opening up this  
"can of worms" this is really what I hoped your efforts to apply  
defacto standard tools like Hibernate and Spring would begin to  
expose and get us talking about.  Welcome on board!

p.s. this discussion is moving to be very developer oriented, but I  
don't want to derail it by moving it to the dev list.  I think we  
might get some interesting non-developer input if we leave it here as  
well.

> -------------------------------------------------------
> -- Community table
> -------------------------------------------------------
> CREATE TABLE Community
> (
>   community_id      INTEGER PRIMARY KEY,
>   name              VARCHAR(128),
>   short_description VARCHAR(512),
>   introductory_text TEXT,
>   logo_bitstream_id INTEGER REFERENCES Bitstream(bitstream_id),
>   copyright_text    TEXT,
>   side_bar_text     TEXT
> );
> -------------------------------------------------------
> -- Collection table
> -------------------------------------------------------
> CREATE TABLE Collection
> (
>   collection_id     INTEGER PRIMARY KEY,
>   name              VARCHAR(128),
>   short_description VARCHAR(512),
>   introductory_text TEXT,
>   logo_bitstream_id INTEGER REFERENCES Bitstream(bitstream_id),
>   template_item_id  INTEGER REFERENCES Item(item_id),
>   provenance_description  TEXT,
>   license           TEXT,
>   copyright_text    TEXT,
>   side_bar_text     TEXT,
>   workflow_step_1   INTEGER REFERENCES EPersonGroup 
> ( eperson_group_id ),
>   workflow_step_2   INTEGER REFERENCES EPersonGroup 
> ( eperson_group_id ),
>   workflow_step_3   INTEGER REFERENCES EPersonGroup 
> ( eperson_group_id ),
>   submitter         INTEGER REFERENCES EPersonGroup 
> ( eperson_group_id ),
>   admin             INTEGER REFERENCES EPersonGroup( eperson_group_id)
> );
> -------------------------------------------------------
> -- Item table
> -------------------------------------------------------
> CREATE TABLE Item
> (
>   item_id         INTEGER PRIMARY KEY,
>   submitter_id    INTEGER REFERENCES EPerson(eperson_id),
>   in_archive      BOOL,
>   withdrawn       BOOL,
>   last_modified   TIMESTAMP WITH TIME ZONE,
>   owning_collection INTEGER
> );
> -------------------------------------------------------
> -- Bundle table
> -------------------------------------------------------
> CREATE TABLE Bundle
> (
>   bundle_id          INTEGER PRIMARY KEY,
>   name               VARCHAR(16),  -- ORIGINAL | THUMBNAIL | TEXT
>   primary_bitstream_id        INTEGER REFERENCES Bitstream(bitstream_id)
> );
> -------------------------------------------------------
> -- Bitstream table
> -------------------------------------------------------
> CREATE TABLE Bitstream
> (
>    bitstream_id            INTEGER PRIMARY KEY,
>    bitstream_format_id     INTEGER REFERENCES  
> BitstreamFormatRegistry(bitstream_format_id),
>    name                    VARCHAR(256),
>    size_bytes              BIGINT,
>    checksum                VARCHAR(64),
>    checksum_algorithm      VARCHAR(32),
>    description             TEXT,
>    user_format_description TEXT,
>    source                  VARCHAR(256),
>    internal_id             VARCHAR(256),
>    deleted                 BOOL,
>    store_number            INTEGER,
>    sequence_id             INTEGER
> );

> -------------------------------------------------------
> -- Community2Community table
> -------------------------------------------------------
> CREATE TABLE Community2Community
> (
>   id             INTEGER PRIMARY KEY,
>   parent_comm_id INTEGER REFERENCES Community(community_id),
>   child_comm_id  INTEGER REFERENCES Community(community_id)
> );
> -------------------------------------------------------
> -- Community2Collection table
> -------------------------------------------------------
> CREATE TABLE Community2Collection
> (
>   id             INTEGER PRIMARY KEY,
>   community_id   INTEGER REFERENCES Community(community_id),
>   collection_id  INTEGER REFERENCES Collection(collection_id)
> );
> -------------------------------------------------------
> -- Collection2Item table
> -------------------------------------------------------
> CREATE TABLE Collection2Item
> (
>   id            INTEGER PRIMARY KEY,
>   collection_id INTEGER REFERENCES Collection(collection_id),
>   item_id       INTEGER REFERENCES Item(item_id)
> );
> -------------------------------------------------------
> -- Item2Bundle table
> -------------------------------------------------------
> CREATE TABLE Item2Bundle
> (
>   id        INTEGER PRIMARY KEY,
>   item_id   INTEGER REFERENCES Item(item_id),
>   bundle_id INTEGER REFERENCES Bundle(bundle_id)
> );
> -------------------------------------------------------
> -- Bundle2Bitstream table
> -------------------------------------------------------
> CREATE TABLE Bundle2Bitstream
> (
>   id           INTEGER PRIMARY KEY,
>   bundle_id    INTEGER REFERENCES Bundle(bundle_id),
>   bitstream_id INTEGER REFERENCES Bitstream(bitstream_id)
> );


On Nov 19, 2007, at 8:34 AM, Andrea Bollini wrote:

> The mismatch is very old...
> I think that the API is the most right source of info, I assume that
> all of us are happy with many-to-many associations for communities  
> (also
> with themselves), collections & items. I'm not sure if we want keep a
> many-to-many for item-bundle-bitstream. In my opinion these are only
> one-to-many and we should change the API (and db) in this way. I know
> that the Jim work about versioning uses the many-to-many  
> association for
> item & bundle to share some of them between different versions, I hope
> that we can change this to share only the physical location of the
> bitstreams. In this way should be more simple to keep different  
> versions
> that change only metadata of a bitstream.
>
> Community * - * Community
> Community * - * Collection
> Collection * - * Item
> Item 1 - * Bundle
> Bundle 1 - * Bitstrea
>
> Best,
> Andrea
>
> daniele.ninfo ha scritto:
>> Hi all,
>>
>> I'm writing to underline some differences between the data model  
>> shown in the dspace site
>> ( http://www.dspace.org/index.php? 
>> option=com_content&task=view&id=149#data_model ) and the actual  
>> code present in the /trunk ( http://dspace-sandbox.googlecode.com/ 
>> svn/mirror/dspace/trunk/ ).
>>
>> There are differences in the associations between objects of the  
>> model: some one-to-many associations in the model are replaced in  
>> the code by many-to-many ones. I found 3 examples:
>>
>> - In the model, a Collection can be owned by only one Community,  
>> whereas in the code a Collection can be owned by many Community.
>> - In the model, a Bundle can be owned by only one Item, and in the  
>> code it can be owned by more then one Item
>> - In the model, a Bitstream can be owned by only one Bundle, and  
>> in the code by more then one.
>>
>> In the database schema, all the associations are many-to-many,  
>> there are no restrictions.
>> I also looked at Jim's DAO-prototype ( http://dspace- 
>> sandbox.googlecode.com/svn/branches/dao-prototype/ ), and he  
>> reflects the database schema, using only many-to-many associations.
>>
>> I'm working on a prototype to introduce Hibernate, and i'd base my  
>> work on Jim's choice, but i'd like to have some opinions about it,  
>> what should we do? update the data model associations? follow the  
>> model and update the code in /trunk? what kind of associations  
>> should be kept and what dropped?
>>
>> My mentor Andrea will write his opinions replying to this email,  
>> to let everyone know them. We both are interested in having other  
>> opinions to choose the best way to go ahead.
>>


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Differences between the data model and the trunk

Reply via email to