Re: [Fedora-commons-developers] Custom database module for Fedora

Peter Herndon Mon, 26 Oct 2009 14:59:49 -0700

Hi Asger,

I think we're mostly agreeing.

On Oct 26, 2009, at 11:50 AM, Asger Askov Blekinge wrote:

> Hi
>
> The gist of my reply was that rather than using some method to define
> which parts of an object should be cached in a database, we should
> instead devise a system storing the entire object in a database.

Yes, I completely agree, what I'd like to see is the entire object  
available in a database.

> I
> believe xml databases to be superior for this, but we could probably  
> do
> with a relational database.

I'm taking a pragmatic approach, in that yes, an xml database might be  
superior, but the software available *right now* that works with  
relational databases is both significantly larger in quantity and is  
already familiar to most developers.  Thus having the object in a  
relational database has the distinct advantage of being more  
immediately useful to a much larger number of developers.

That may not be a design goal for the Fedora project, however -- the  
software itself isn't meant as an all-purpose solution, and the  
developer community is to some extent self-selecting.  Still, making  
Fedora objects more accessible to more developers means a wider pool  
of developers that have relevant skill-sets.  Which is good for  
libraries, museums, et al. whose pool of candidates would be larger.

>
> On Mon, 2009-10-26 at 15:09 +0100, Peter Herndon wrote:
>> I wouldn't say this proposal was about speed.  I'd say this proposal
>> is about flexibility.
> Well, the original poster said "For speed reasons we wanted a database
> that contains the same information Fedora contains". But you are  
> right,
> there could be other benefits of the proposal.
>

Indeed, I stand corrected.  I should rephrase to say that access speed  
is indeed one potential benefit, and flexibility is another.

>
>>
>> If you have all your Fedora data stored in a relational database,
>> rather than just bits and pieces, you can access it via any system
>> that also talks to an RDBMS.  Rather than having to write single-
>> purpose client code that talks to Fedora's REST or SOAP APIs, or
>> parses FOXML files directly, or talks to Mulgara, or to gSearch, or  
>> to
>> the parts that are stored in the RDBMS now, or more realistically,
>> some horrible combination of the above, you can simply talk directly
>> to the RDBMS and get any and all data you need.
> You forgot the FedoraClient and the Resource Index from that list...

:)  Indeed.

>
>
>>
>> Honestly, this would make Fedora almost instantly accessible via any
>> number of web frameworks (e.g. Django, Rails, etc.) that have an ORM,
>> vastly widening your potential developer base.  Instead of having to
>> learn some cumbersome client library that breaks every time the core
>> developers change their API (yes, that's a comment on the REST API's
>> lack of stability so far (and yes, I know it has been in beta, but
>> it's still been a point of pain)), a developer can just point their
>> framework of choice at the DB and away they go.  Big win all around
>> from my perspective.
>
> I am not convinced. Having a straight RDBMS is faster and simpler than
> Fedora. But Fedora does not try to be a less efficient RDBMS. It tries
> to be a digital object store (I think), and that impose some extra
> demands.
>
> I have tried to formulate a description of what Fedora actually is,
> which conflicts somewhat with what you have:
>
> Fedora is a data store, working on data blobs (datastreams), and
> grouping such blobs in larger units (objects). The data blobs and the
> larger units have RDF relations, and a triple store can be queried  
> about
> these. Access to the data can be restricted by policies. Everything is
> backed by simple files, as that should be easier to preserve.
>
> It would not be difficult to devise a database schema for a digital
> object, or even to provide a database backend instead of the foxml
> files. Of course if this database is accessed directly, rather than
> through any Fedora interfaces, the rest of the system might become out
> of sync. Is this what you are asking for, because then I would give it
> some thought?

Hmm.  It seems to me that the fundamental purpose of Fedora is not as  
clear as it once was.  That is, I have always taken Fedora's purpose  
to be two-fold:  to preserve a digital object and its associated  
metadata, while making that object available in a controlled manner.   
Preservation, and access control.

And yet, with the work currently underway on both the storage layer  
and the security layer, where both sets of functionality are being  
broken out into separate APIs and implementations, I think that those  
two main goals are being delegated to the implementation layers.

I'm not entirely sure where I'm going with this, as I do still believe  
that Fedora should maintain the preserve+access control purpose, and  
in my own work I've found that Fedora isn't really appropriate for my  
use cases.  Keeping one single point of definition for the data  
surrounding digital objects is a worthy goal, and definitely  
simplifies the whole architecture.  On the other hand, the flexibility  
advantages to be had from accessing the database directly could be  
profound.

As an example, in my own work I wrote a REST API client in Python,  
which I then use to inspect objects in Fedora.  I wrote a module for  
Django which uses my client library to request all the objects, read  
the FOXML, DC and MODS, and create Django ORM objects.  Each such  
Python object is backed via the Django ORM by a record in the RDBMS  
that Django normally uses for data persistence.  As it stands, the  
application that uses these modules is read-only, so I don't have to  
worry about synchronizing changes from the database back to Fedora.   
But my client library does have the ability to write back to Fedora,  
which I use during ingest.  My use case is presenting images, and I'm  
pulling extensive image metadata (EXIF, XMP) from the individual image  
files and storing it in the DC and MODS datastreams during the ingest  
process.  Users can then search on that metadata, as I'm using a  
Django module that integrates Solr search into Django.  I'm not using  
gSearch.

If all data for every object were stored in the RDBMS to start with, I  
wouldn't have to write as much code to make Fedora useful.  I'd still  
have to write *some* code to adapt the Fedora data to my framework of  
choice, but it would be significantly less code.

But then there's the problem of preservation.  By making the database  
the primary data persistence mechanism, I allow programming  
flexibility at the cost of preservation flexibility.  To maintain  
preservation, Fedora would need a tool that writes changes made in the  
DB to the plain-text file representation stored on the file system.   
That tool should be capable of maintaining file consistency and  
integrity even in the face of sudden loss of the DB.

Keeping the metadata *only* in the RDBMS is no solution, from a  
preservation standpoint.  It is certainly possible, but not at all  
practical, to migrate the metadata indefinitely.

Keeping the metadata only in the file system presents the challenges I  
currently face.

Keeping the metadata in both the DB and the file system presents the  
challenge of synchronizing the two, for preservation.

Come to think of it, I have no idea how Fedora would implement access  
control for direct DB connections.  In my current cases, I do my  
access control in the UI layer.  My authentication is via LDAP, and my  
authorization is done using permissions within the Django web  
framework.  I set up Fedora so that all access is through the web  
interface, and I don't use XACML at all.

If FESL comes to fruition, and works as a stand-alone web service, I  
could see myself using it as a means of authorization.  I would  
absolutely love to have my ACLs stored with the digital object, and  
have my access control be rebuildable.  As of right now, though, I'm  
bypassing Fedora's access control entirely.

Again, I don't know if I have anything resembling a point, except to  
say that data in an RDBMS is more immediately useful to me than data  
behind an API.  But data behind an API is still more useful than no  
data at all, so there's that.

Regards,

---Peter

>
> Regards
>
>
>
>
>> Regards,
>>
>> ---Peter
>> On Oct 26, 2009, at 8:14 AM, Gert Schmeltz Pedersen wrote:
>>
>>> Have you tried to use Fedora GSearch? I do not think that a
>>> relational database search nor an xml database search perform better
>>> than GSearch with Lucene or Solr.
>>>
>>> Cheers,
>>> Gert
>>>
>>>> -----Original Message-----
>>>> From: Asger Askov Blekinge [mailto:a...@statsbiblioteket.dk]
>>>> Sent: Monday, October 26, 2009 12:47 PM
>>>> To: Lodewijk Bogaards
>>>> Cc: Fedora commons developers
>>>> Subject: Re: [Fedora-commons-developers] Custom database module for
>>>> Fedora
>>>>
>>>> Hi
>>>>
>>>> 10 days and no replies. That's not nice of people. So here I go.
>>>>
>>>> I think I can follow the design you propose, even though I am not
>>>> really
>>>> into the database code part of Fedora.
>>>> To retell it, so you can check my understanding: There is some  
>>>> config
>>>> in
>>>> DefaultDOManager.dbspec that determines which part of a fedora  
>>>> object
>>>> is
>>>> cached in the database. You amend that config, so that the user can
>>>> provide a config file, so that additional content is cached.
>>>> That's all there is, right?
>>>>
>>>> I am not against the idea, but I consider it a stopgap measure.
>>>> The problem you outline is that actually querying the foxml files
>>>> is to
>>>> slow in the fedora design. You want a faster way to access the
>>>> contents,
>>>> and thus you propose to store it in a database. So far I agree, the
>>>> fedora backend is not fast for small queries (as the entire  
>>>> object is
>>>> parsed for any query), and some indexed frontend is sometimes
>>>> required.
>>>> Now, I do not know the performance of the various open source xml
>>>> databases, but it sounds radically simpler to store/backup the  
>>>> foxml
>>>> objects in an xml database, than writing complex expressions for
>>>> mapping
>>>> selected parts to a relational database.
>>>>
>>>> Having such an database, which could either be a cache of the foxml
>>>> files, or the primary store for the foxml files would allow fast
>>>> queries
>>>> about properties on the objects or datastreams. This should
>>>> probably be
>>>> the design we work towards, but your idea could easily serve as a
>>>> current way of doing database integration while we have no xml
>>>> database.
>>>>
>>>> Regards
>>>>
>>>>
>>>> On Fri, 2009-10-16 at 20:32 +0200, Lodewijk Bogaards wrote:
>>>>> Hi,
>>>>>
>>>>> For speed reasons we wanted a database that contains the same
>>>> information
>>>>> Fedora contains. I have emailed before (subject: gDatabase) that I
>>>> figured
>>>>> that Fedora already has a feature to do so, for the dublin core  
>>>>> and
>>>> some
>>>>> other digital object properties, and that with some work Fedora  
>>>>> can
>>>> be made
>>>>> to keep the database synchronized for its user-made XML data as
>>>>> well.
>>>>> Currently I have this working within Fedora.
>>>>>
>>>>> I am sending you the source which was made on top of the Fedora
>>>>> 3.2.1
>>>> source
>>>>> release, an example foxml and database schema.
>>>>>
>>>>> The idea is that DefaultDOManager.dbspec is extended with this  
>>>>> line:
>>>>>
>>>>>   <include href="server/config/custom-db.xml" />
>>>>>
>>>>> Then in that file under the Fedora home dir you can put your own
>>>> database
>>>>> schema, which is an extension of the database schema used in the
>>>> dbspec
>>>>> file.
>>>>>
>>>>> Columns get their data by value getters. Currently I have
>>>>> implemented
>>>> one
>>>>> value getter that uses an xPath query to get a value. This value
>>>> getting
>>>>> code does not necessarily run for all digital objects. It is
>>>>> possible
>>>> to
>>>>> choose a content model and/or datastream id that must be present  
>>>>> for
>>>> the
>>>>> tables to be updated by the digital object. Here is an example of
>>>> table with
>>>>> a column:
>>>>>
>>>>> <table name="easyFiles" contentModel="info:fedora/fedora-
>>>> system:easyfile"
>>>>> datastreamId="file">
>>>>>
>>>>> <column name="filename" type="varchar(256)" notNull="true"
>>>> index="filename"
>>>>> default="-">
>>>>> <value delimiterType="row" delimiter=",">
>>>>>   <valuegetter type="xPath" xPath="//easyfile:filename"
>>>>> nsPrefix="easyfile" nsUri="http://easy.dans.knaw.nl/files";
>>>>> delimiterType="normal" delimiter="," />
>>>>> </value>
>>>>> </column>
>>>>>
>>>>> An xPath query may return several values. For that two kinds of
>>>> delimiters
>>>>> may be used. A row delimiter (meaning several rows are created for
>>>> each
>>>>> value) and a normal delimiter (meaning a string value is inserted
>>>> after
>>>>> every row). Also a values tag may contain several valuegetter  
>>>>> tags,
>>>> which
>>>>> can be delimited in the same two ways.
>>>>> If two columns return two rows those two rows are added together  
>>>>> as
>>>> one row.
>>>>> Also a defaultvalue for a second valuegetter may be used. Thus
>>>> creating the
>>>>> possibility of composing rows almost any way one wants based on
>>>> Fedora data.
>>>>>
>>>>> A pid must always be present, but does not need to be the primary
>>>>> key
>>>>> (primaryKey attribute of the table). It is thus up to the user how
>>>> the data
>>>>> is composed into tables, and if the user makes a mistake an
>>>> SQLException is
>>>>> thrown and the digital object is thus not ingested/updated, thus
>>>> forming
>>>>> another kind of safety net that does not necessarily work so  
>>>>> well if
>>>> the
>>>>> database would be filled from within the users application.
>>>>>
>>>>> With this simple system it is possible to do almost any kind of
>>>> database
>>>>> synchronization based on Fedora data. I have seen many projects
>>>>> based
>>>> on
>>>>> Fedora that employ a database alongside Fedora in order to speed  
>>>>> up
>>>> the
>>>>> querying process. I therefore think this might be useful for many.
>>>>>
>>>>> Of course the search interface that comes with Fedora may also be
>>>> extended
>>>>> to make use of this new feature, but since that is not a need for
>>>>> our
>>>>> project at the moment I have not taken the time to do so.
>>>>>
>>>>> I would be very pleased if this could become part of subsequent
>>>> Fedora
>>>>> releases. Hopefully others think so too.
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Lodewijk Bogaards
>>>>>
>>>>>
>>>>
>>>>
>>>> -----------------------------------------------------------------------
>>>> -------
>>>> Come build with us! The BlackBerry(R) Developer Conference in SF,  
>>>> CA
>>>> is the only developer event you need to attend this year. Jumpstart
>>>> your
>>>> developing skills, take BlackBerry mobile applications to market  
>>>> and
>>>> stay
>>>> ahead of the curve. Join us from November 9 - 12, 2009. Register  
>>>> now!
>>>> http://p.sf.net/sfu/devconference
>>>> _______________________________________________
>>>> Fedora-commons-developers mailing list
>>>> Fedora-commons-developers@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-
>>>> developers
>>>
>>> ------------------------------------------------------------------------------
>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart
>>> your
>>> developing skills, take BlackBerry mobile applications to market and
>>> stay
>>> ahead of the curve. Join us from November 9 - 12, 2009. Register  
>>> now!
>>> http://p.sf.net/sfu/devconference
>>> _______________________________________________
>>> Fedora-commons-developers mailing list
>>> Fedora-commons-developers@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers
>>
>>
>> ------------------------------------------------------------------------------
>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
>> is the only developer event you need to attend this year. Jumpstart  
>> your
>> developing skills, take BlackBerry mobile applications to market  
>> and stay
>> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
>> http://p.sf.net/sfu/devconference
>> _______________________________________________
>> Fedora-commons-developers mailing list
>> Fedora-commons-developers@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons- 
>> developers
>

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Re: [Fedora-commons-developers] Custom database module for Fedora

Reply via email to