Re: [Fedora-commons-developers] Custom database module for Fedora

Schwichtenberg, Frank Tue, 27 Oct 2009 03:47:57 -0700

Hello,
that is a very interesting discussion. Actually the positions seem not
to be so far away from each other.


I think, working with Fedora it is obvious something is needed to get
a good performance for specific - or better: generic - use cases. It
seems for that reason there are approaches like resource index and
gsearch. For a generic approach it is obviously an advantage to have
the entire object in a RDBMS. The crucial point is: which object and
who persists? The Fedora approach has its benefits and that seems the
reason for a lot of people actually using Fedora. Thanks, Asgar, for
that good summary what Fedora does/is. If the Fedora Object should be
fully persisted in a RDBMS that should of course be done by Fedora
itself and it should speed up Fedora. (Someone may decide to read (!)
from that database directly but s/he should have very good reasons.)
The problem here is storing the blobs, even if there are
XML-blobs. And because Fedora does not look into the XML/blobs how/why
should it provide an interface for searching it. GSearch is a useful
extension for that purpose. In the same way one can create a XML-Index
or fill an XML-Database.

Another way is - what I think is kind of what is described by Peter -
to let the system that uses Fedora store its own view of
objects. Advantage: probably no blobs. Drawback: how to achieve
synchronicity between the RDBMS and Fedora. The latter may be done via
some post-persist hooks which use the Fedora API. So we would have a
system that performs on its own database containing system-specific 
objects and be in sync with FOXMLs managed by Fedora.

So, maybe I just want to say I like the "f" and the "a" in Fedora. ;-)
My impression is one way might be to speed up Fedora itself but do it
internal and keep the flexibility. Another way is to do it outside of
Fedora in a system that uses Fedora. 

In my opinion, the extension proposed by Lodewijk does not go so
far. But is very useful enhancing the interaction with Fedora. It
seems not to be the idea of providing everything-what-ever-can-be but
just extend the possibilities of finding objects in Fedora!? I just
don't like to use the database directly but that is also addressed by
Lodewijk by proposing an extension of Fedoras search interface. I would
just pick out a few characteristic values in order to select specific
objects. I would not use it for a full search over all (XML-)datastreams
of an object.

Regards, Frank

> -----Ursprüngliche Nachricht-----
> Von: Peter Herndon [mailto:tphern...@gmail.com]
> Gesendet: Montag, 26. Oktober 2009 22:59
> An: Fedora commons developers
> Betreff: Re: [Fedora-commons-developers] Custom database module for
> Fedora
> 
> Hi Asger,
> 
> I think we're mostly agreeing.
> 
> On Oct 26, 2009, at 11:50 AM, Asger Askov Blekinge wrote:
> 
> > Hi
> >
> > The gist of my reply was that rather than using some method to define
> > which parts of an object should be cached in a database, we should
> > instead devise a system storing the entire object in a database.
> 
> Yes, I completely agree, what I'd like to see is the entire object
> available in a database.
> 
> > I
> > believe xml databases to be superior for this, but we could probably
> > do
> > with a relational database.
> 
> I'm taking a pragmatic approach, in that yes, an xml database might be
> superior, but the software available *right now* that works with
> relational databases is both significantly larger in quantity and is
> already familiar to most developers.  Thus having the object in a
> relational database has the distinct advantage of being more
> immediately useful to a much larger number of developers.
> 
> That may not be a design goal for the Fedora project, however -- the
> software itself isn't meant as an all-purpose solution, and the
> developer community is to some extent self-selecting.  Still, making
> Fedora objects more accessible to more developers means a wider pool
> of developers that have relevant skill-sets.  Which is good for
> libraries, museums, et al. whose pool of candidates would be larger.
> 
> >
> > On Mon, 2009-10-26 at 15:09 +0100, Peter Herndon wrote:
> >> I wouldn't say this proposal was about speed.  I'd say this proposal
> >> is about flexibility.
> > Well, the original poster said "For speed reasons we wanted a
> database
> > that contains the same information Fedora contains". But you are
> > right,
> > there could be other benefits of the proposal.
> >
> 
> Indeed, I stand corrected.  I should rephrase to say that access speed
> is indeed one potential benefit, and flexibility is another.
> 
> >
> >>
> >> If you have all your Fedora data stored in a relational database,
> >> rather than just bits and pieces, you can access it via any system
> >> that also talks to an RDBMS.  Rather than having to write single-
> >> purpose client code that talks to Fedora's REST or SOAP APIs, or
> >> parses FOXML files directly, or talks to Mulgara, or to gSearch, or
> >> to
> >> the parts that are stored in the RDBMS now, or more realistically,
> >> some horrible combination of the above, you can simply talk directly
> >> to the RDBMS and get any and all data you need.
> > You forgot the FedoraClient and the Resource Index from that list...
> 
> :)  Indeed.
> 
> >
> >
> >>
> >> Honestly, this would make Fedora almost instantly accessible via any
> >> number of web frameworks (e.g. Django, Rails, etc.) that have an
> ORM,
> >> vastly widening your potential developer base.  Instead of having to
> >> learn some cumbersome client library that breaks every time the core
> >> developers change their API (yes, that's a comment on the REST API's
> >> lack of stability so far (and yes, I know it has been in beta, but
> >> it's still been a point of pain)), a developer can just point their
> >> framework of choice at the DB and away they go.  Big win all around
> >> from my perspective.
> >
> > I am not convinced. Having a straight RDBMS is faster and simpler
> than
> > Fedora. But Fedora does not try to be a less efficient RDBMS. It
> tries
> > to be a digital object store (I think), and that impose some extra
> > demands.
> >
> > I have tried to formulate a description of what Fedora actually is,
> > which conflicts somewhat with what you have:
> >
> > Fedora is a data store, working on data blobs (datastreams), and
> > grouping such blobs in larger units (objects). The data blobs and the
> > larger units have RDF relations, and a triple store can be queried
> > about
> > these. Access to the data can be restricted by policies. Everything
> is
> > backed by simple files, as that should be easier to preserve.
> >
> > It would not be difficult to devise a database schema for a digital
> > object, or even to provide a database backend instead of the foxml
> > files. Of course if this database is accessed directly, rather than
> > through any Fedora interfaces, the rest of the system might become
> out
> > of sync. Is this what you are asking for, because then I would give
> it
> > some thought?
> 
> Hmm.  It seems to me that the fundamental purpose of Fedora is not as
> clear as it once was.  That is, I have always taken Fedora's purpose
> to be two-fold:  to preserve a digital object and its associated
> metadata, while making that object available in a controlled manner.
> Preservation, and access control.
> 
> And yet, with the work currently underway on both the storage layer
> and the security layer, where both sets of functionality are being
> broken out into separate APIs and implementations, I think that those
> two main goals are being delegated to the implementation layers.
> 
> I'm not entirely sure where I'm going with this, as I do still believe
> that Fedora should maintain the preserve+access control purpose, and
> in my own work I've found that Fedora isn't really appropriate for my
> use cases.  Keeping one single point of definition for the data
> surrounding digital objects is a worthy goal, and definitely
> simplifies the whole architecture.  On the other hand, the flexibility
> advantages to be had from accessing the database directly could be
> profound.
> 
> As an example, in my own work I wrote a REST API client in Python,
> which I then use to inspect objects in Fedora.  I wrote a module for
> Django which uses my client library to request all the objects, read
> the FOXML, DC and MODS, and create Django ORM objects.  Each such
> Python object is backed via the Django ORM by a record in the RDBMS
> that Django normally uses for data persistence.  As it stands, the
> application that uses these modules is read-only, so I don't have to
> worry about synchronizing changes from the database back to Fedora.
> But my client library does have the ability to write back to Fedora,
> which I use during ingest.  My use case is presenting images, and I'm
> pulling extensive image metadata (EXIF, XMP) from the individual image
> files and storing it in the DC and MODS datastreams during the ingest
> process.  Users can then search on that metadata, as I'm using a
> Django module that integrates Solr search into Django.  I'm not using
> gSearch.
> 
> If all data for every object were stored in the RDBMS to start with, I
> wouldn't have to write as much code to make Fedora useful.  I'd still
> have to write *some* code to adapt the Fedora data to my framework of
> choice, but it would be significantly less code.
> 
> But then there's the problem of preservation.  By making the database
> the primary data persistence mechanism, I allow programming
> flexibility at the cost of preservation flexibility.  To maintain
> preservation, Fedora would need a tool that writes changes made in the
> DB to the plain-text file representation stored on the file system.
> That tool should be capable of maintaining file consistency and
> integrity even in the face of sudden loss of the DB.
> 
> Keeping the metadata *only* in the RDBMS is no solution, from a
> preservation standpoint.  It is certainly possible, but not at all
> practical, to migrate the metadata indefinitely.
> 
> Keeping the metadata only in the file system presents the challenges I
> currently face.
> 
> Keeping the metadata in both the DB and the file system presents the
> challenge of synchronizing the two, for preservation.
> 
> Come to think of it, I have no idea how Fedora would implement access
> control for direct DB connections.  In my current cases, I do my
> access control in the UI layer.  My authentication is via LDAP, and my
> authorization is done using permissions within the Django web
> framework.  I set up Fedora so that all access is through the web
> interface, and I don't use XACML at all.
> 
> If FESL comes to fruition, and works as a stand-alone web service, I
> could see myself using it as a means of authorization.  I would
> absolutely love to have my ACLs stored with the digital object, and
> have my access control be rebuildable.  As of right now, though, I'm
> bypassing Fedora's access control entirely.
> 
> Again, I don't know if I have anything resembling a point, except to
> say that data in an RDBMS is more immediately useful to me than data
> behind an API.  But data behind an API is still more useful than no
> data at all, so there's that.
> 
> Regards,
> 
> ---Peter
> 
> 
> >
> > Regards
> >
> >
> >
> >
> >> Regards,
> >>
> >> ---Peter
> >> On Oct 26, 2009, at 8:14 AM, Gert Schmeltz Pedersen wrote:
> >>
> >>> Have you tried to use Fedora GSearch? I do not think that a
> >>> relational database search nor an xml database search perform
> better
> >>> than GSearch with Lucene or Solr.
> >>>
> >>> Cheers,
> >>> Gert
> >>>
> >>>> -----Original Message-----
> >>>> From: Asger Askov Blekinge [mailto:a...@statsbiblioteket.dk]
> >>>> Sent: Monday, October 26, 2009 12:47 PM
> >>>> To: Lodewijk Bogaards
> >>>> Cc: Fedora commons developers
> >>>> Subject: Re: [Fedora-commons-developers] Custom database module
> for
> >>>> Fedora
> >>>>
> >>>> Hi
> >>>>
> >>>> 10 days and no replies. That's not nice of people. So here I go.
> >>>>
> >>>> I think I can follow the design you propose, even though I am not
> >>>> really
> >>>> into the database code part of Fedora.
> >>>> To retell it, so you can check my understanding: There is some
> >>>> config
> >>>> in
> >>>> DefaultDOManager.dbspec that determines which part of a fedora
> >>>> object
> >>>> is
> >>>> cached in the database. You amend that config, so that the user
> can
> >>>> provide a config file, so that additional content is cached.
> >>>> That's all there is, right?
> >>>>
> >>>> I am not against the idea, but I consider it a stopgap measure.
> >>>> The problem you outline is that actually querying the foxml files
> >>>> is to
> >>>> slow in the fedora design. You want a faster way to access the
> >>>> contents,
> >>>> and thus you propose to store it in a database. So far I agree,
> the
> >>>> fedora backend is not fast for small queries (as the entire
> >>>> object is
> >>>> parsed for any query), and some indexed frontend is sometimes
> >>>> required.
> >>>> Now, I do not know the performance of the various open source xml
> >>>> databases, but it sounds radically simpler to store/backup the
> >>>> foxml
> >>>> objects in an xml database, than writing complex expressions for
> >>>> mapping
> >>>> selected parts to a relational database.
> >>>>
> >>>> Having such an database, which could either be a cache of the
> foxml
> >>>> files, or the primary store for the foxml files would allow fast
> >>>> queries
> >>>> about properties on the objects or datastreams. This should
> >>>> probably be
> >>>> the design we work towards, but your idea could easily serve as a
> >>>> current way of doing database integration while we have no xml
> >>>> database.
> >>>>
> >>>> Regards
> >>>>
> >>>>
> >>>> On Fri, 2009-10-16 at 20:32 +0200, Lodewijk Bogaards wrote:
> >>>>> Hi,
> >>>>>
> >>>>> For speed reasons we wanted a database that contains the same
> >>>> information
> >>>>> Fedora contains. I have emailed before (subject: gDatabase) that
> I
> >>>> figured
> >>>>> that Fedora already has a feature to do so, for the dublin core
> >>>>> and
> >>>> some
> >>>>> other digital object properties, and that with some work Fedora
> >>>>> can
> >>>> be made
> >>>>> to keep the database synchronized for its user-made XML data as
> >>>>> well.
> >>>>> Currently I have this working within Fedora.
> >>>>>
> >>>>> I am sending you the source which was made on top of the Fedora
> >>>>> 3.2.1
> >>>> source
> >>>>> release, an example foxml and database schema.
> >>>>>
> >>>>> The idea is that DefaultDOManager.dbspec is extended with this
> >>>>> line:
> >>>>>
> >>>>>   <include href="server/config/custom-db.xml" />
> >>>>>
> >>>>> Then in that file under the Fedora home dir you can put your own
> >>>> database
> >>>>> schema, which is an extension of the database schema used in the
> >>>> dbspec
> >>>>> file.
> >>>>>
> >>>>> Columns get their data by value getters. Currently I have
> >>>>> implemented
> >>>> one
> >>>>> value getter that uses an xPath query to get a value. This value
> >>>> getting
> >>>>> code does not necessarily run for all digital objects. It is
> >>>>> possible
> >>>> to
> >>>>> choose a content model and/or datastream id that must be present
> >>>>> for
> >>>> the
> >>>>> tables to be updated by the digital object. Here is an example of
> >>>> table with
> >>>>> a column:
> >>>>>
> >>>>> <table name="easyFiles" contentModel="info:fedora/fedora-
> >>>> system:easyfile"
> >>>>> datastreamId="file">
> >>>>>
> >>>>> <column name="filename" type="varchar(256)" notNull="true"
> >>>> index="filename"
> >>>>> default="-">
> >>>>> <value delimiterType="row" delimiter=",">
> >>>>>   <valuegetter type="xPath" xPath="//easyfile:filename"
> >>>>> nsPrefix="easyfile" nsUri="http://easy.dans.knaw.nl/files";
> >>>>> delimiterType="normal" delimiter="," />
> >>>>> </value>
> >>>>> </column>
> >>>>>
> >>>>> An xPath query may return several values. For that two kinds of
> >>>> delimiters
> >>>>> may be used. A row delimiter (meaning several rows are created
> for
> >>>> each
> >>>>> value) and a normal delimiter (meaning a string value is inserted
> >>>> after
> >>>>> every row). Also a values tag may contain several valuegetter
> >>>>> tags,
> >>>> which
> >>>>> can be delimited in the same two ways.
> >>>>> If two columns return two rows those two rows are added together
> >>>>> as
> >>>> one row.
> >>>>> Also a defaultvalue for a second valuegetter may be used. Thus
> >>>> creating the
> >>>>> possibility of composing rows almost any way one wants based on
> >>>> Fedora data.
> >>>>>
> >>>>> A pid must always be present, but does not need to be the primary
> >>>>> key
> >>>>> (primaryKey attribute of the table). It is thus up to the user
> how
> >>>> the data
> >>>>> is composed into tables, and if the user makes a mistake an
> >>>> SQLException is
> >>>>> thrown and the digital object is thus not ingested/updated, thus
> >>>> forming
> >>>>> another kind of safety net that does not necessarily work so
> >>>>> well if
> >>>> the
> >>>>> database would be filled from within the users application.
> >>>>>
> >>>>> With this simple system it is possible to do almost any kind of
> >>>> database
> >>>>> synchronization based on Fedora data. I have seen many projects
> >>>>> based
> >>>> on
> >>>>> Fedora that employ a database alongside Fedora in order to speed
> >>>>> up
> >>>> the
> >>>>> querying process. I therefore think this might be useful for
> many.
> >>>>>
> >>>>> Of course the search interface that comes with Fedora may also be
> >>>> extended
> >>>>> to make use of this new feature, but since that is not a need for
> >>>>> our
> >>>>> project at the moment I have not taken the time to do so.
> >>>>>
> >>>>> I would be very pleased if this could become part of subsequent
> >>>> Fedora
> >>>>> releases. Hopefully others think so too.
> >>>>>
> >>>>> Kind regards,
> >>>>>
> >>>>> Lodewijk Bogaards
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> ------------------------------------------------------------------
> -----
> >>>> -------
> >>>> Come build with us! The BlackBerry(R) Developer Conference in SF,
> >>>> CA
> >>>> is the only developer event you need to attend this year.
> Jumpstart
> >>>> your
> >>>> developing skills, take BlackBerry mobile applications to market
> >>>> and
> >>>> stay
> >>>> ahead of the curve. Join us from November 9 - 12, 2009. Register
> >>>> now!
> >>>> http://p.sf.net/sfu/devconference
> >>>> _______________________________________________
> >>>> Fedora-commons-developers mailing list
> >>>> Fedora-commons-developers@lists.sourceforge.net
> >>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-
> >>>> developers
> >>>
> >>> -------------------------------------------------------------------
> -----------
> >>> Come build with us! The BlackBerry(R) Developer Conference in SF,
> CA
> >>> is the only developer event you need to attend this year. Jumpstart
> >>> your
> >>> developing skills, take BlackBerry mobile applications to market
> and
> >>> stay
> >>> ahead of the curve. Join us from November 9 - 12, 2009. Register
> >>> now!
> >>> http://p.sf.net/sfu/devconference
> >>> _______________________________________________
> >>> Fedora-commons-developers mailing list
> >>> Fedora-commons-developers@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-
> developers
> >>
> >>
> >> --------------------------------------------------------------------
> ----------
> >> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> >> is the only developer event you need to attend this year. Jumpstart
> >> your
> >> developing skills, take BlackBerry mobile applications to market
> >> and stay
> >> ahead of the curve. Join us from November 9 - 12, 2009. Register
> now!
> >> http://p.sf.net/sfu/devconference
> >> _______________________________________________
> >> Fedora-commons-developers mailing list
> >> Fedora-commons-developers@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/fedora-commons-
> >> developers
> >
> 
> 
> -----------------------------------------------------------------------
> -------
> Come build with us! The BlackBerry(R) Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart
> your
> developing skills, take BlackBerry mobile applications to market and
> stay
> ahead of the curve. Join us from November 9 - 12, 2009. Register now!
> http://p.sf.net/sfu/devconference
> _______________________________________________
> Fedora-commons-developers mailing list
> Fedora-commons-developers@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers


-------------------------------------------------------

Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische 
Information mbH. 
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 
101892. 
Geschäftsführerin: Sabine Brünger-Weilandt. 
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.



------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Re: [Fedora-commons-developers] Custom database module for Fedora

Reply via email to