Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

Larry Stone Wed, 30 May 2007 22:35:46 -0700

> On Wed, May 30, 2007 at 02:01:50AM -0400, Larry Stone wrote:
> > How about the word "resource" to introduce the URI, since it is, after
> > all, a reference to a resource -- the "R" in URI.  It'd be:
> >
> >  <prefix>/resource/<encoded-URI>  e.g.
> >
> >  http://dspace.me.ac.uk/resource/hdl/1234/56
> >
> > This follows the proposal to encode the URI by tearing off the scheme
> > and putting it in a separate pathname element to avoid issues over
> > quoting the ":".  Note that I propose using the actual scheme label
> > in the URL rather than a user-friendly label, e.g. "hdl" rather than
> > "handle".
>
> This sounds like some reasonable middle-ground. The only issue I can see
> here is that this mechanism only allows us to refer to objects that have
> persistent identifiers. Of course, we could still use an "internal" form
> of identifier for objects without actual persistent identifiers, but
> then if we have an internal format, should we not use that everywhere?
> Aside from consistency, Mark made the observation that including the
> persistent identifier in the URL is, to a certain extent, bogus. Perhaps
> we could just provide the ability to resolve URLs of the above form, but
> for making links, etc, we use an internal identifier format.


That's a good point -- DSpace is taking on the function of resolving
persistent identifiers like Handles and DOIs when there is no need, since
Handles, at least, already have a Web proxy server.  I wasn't counting
on the "add our own flavor of PIDs to DSpace" getting resolved favorably..

It _does_ have to allow data model objects to be referenced (through Web
interfaces) by an URL that includes a _persistent_ identifier (as
opposed to, say, a database-ID).  That's the URL that will get used in
links and citations despite our best efforts to promote Handles, so it
needs to be reasonably permanent.

Given a DSpace-specific persistent identifier (e.g. the UUID scheme),
I see two options:

1. Give every content-model object a DSpace-type PID, no matter what.
   External references are URLs including the DSpace PID.
   Other PID schemes (e.g. Handle) resolve to those URLs.
   Allow plugins to register other PIDs when an object is created.
    
2. Make the "DSpace" PID into a PersistentIdentifier plugin so it is a
   peer with the Handle or DOI plugins.  The administrator chooses to
   support one or more, and the canonical external reference to an
   object becomes whichever kind of PID is configured to be canonical.

Choice (1) is simpler and seems more sensible, but (2) could be
completely backward-compatible.

Note that some ingested objects will already have PIDs, e.g. if they
are AIPs being re-ingested to reconstruct an archive after catastropic
failure, or DIPs (AIPs) mirrored from another repository.  If _all_
DSpaces have the same PID scheme as in (1), there's no problem ingesting
and accessing another archive's objects.  Under (2), you could end up
generating new PIDs for old objects because your archive doesn't
understand the kind of PID they already have.

I think the UUID scheme (or something like it) makes a whole lot of sense,
but it is a rather significant change.

> > Re special characters and quoting: I agree with James' original point that
> > the HTTP URL spec has quoting rules for just this reason, but from a
> > practical point of view, the client and server implementations have a lot
> > of bugs in this area.  That's what I discovered implementing WebDAV for
> > the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g.
> > within a Handle, because it would just get stomped on differently by
> > the different clients.  Better to let it get used literally as a
> > path element separator and make the servlet clever enough to figure it out.
> > Also, construct the servlet's URL so the whole path after a certain
> > point is part of the object URI, e.g. the Handle.
>
> Again, this sounds fine. The only reason this doesn't work with the
> current implementation with Handles is for referencing bitstreams -- we
> are forced to make assumptions about the structure of the persistent
> identifiers because we use the (arbitrary and unpredictable) filename as
> part of the URL. This must be avoided, whichever scheme we eventually
> use.

Do you mean the way Bitstreams are referenced in a "/bitstream/" servlet
URL?  I thought the path actually doesn't matter there -- it can be
anything, the servlet only looks at the sequence ID, because the URL
follows the pattern:

 <prefix>/bitstream/<handle>/<SequenceID>/<path>
  
e.g.

 http://dspace.mit.edu/bitstream/1721.1/35700/2/60504128-MIT.pdf

...hmm, it didn't _used_ to care what the path was at all, it would
retrieve the bitstream referenced by the Sequence ID.  Now, at
least on the 1.4.1 system I checked, both SID and path have to match.

But it doesn't have to be implemetned that way.  Since Sequence IDs
are the ONLY Bitstream metadata which must be unique within an Item,
the servlet might as well just ignore the path.

(Of course, this ignores the necessity of the "/html/" servlet which
_only_ uses the path to locate Bitstreams, leaving out the SequenceID
because it has to satisfy relative link paths within the bitstreams
of an archived website..)

So, if we allow literal persistent IDs to be stuffed into URLs,
I guess the PersistentIdentifier plugin needs a method to help parse
the URL and return the segment of path after the persistent ID, so
we can follow a general rule for encoding bitstream references.
The other option would be to have the PID-type-specific method
interpret the entire rest of the path itself, returning the
indicated content model object, possibly a Bitstream.

    -- Larry


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

Reply via email to