Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-31 Thread Robert Tansley
On 31/05/07, Larry Stone [EMAIL PROTECTED] wrote:

 ...hmm, it didn't _used_ to care what the path was at all, it would
 retrieve the bitstream referenced by the Sequence ID.  Now, at
 least on the 1.4.1 system I checked, both SID and path have to match.

 But it doesn't have to be implemetned that way.  Since Sequence IDs
 are the ONLY Bitstream metadata which must be unique within an Item,
 the servlet might as well just ignore the path.

I made this change because it was leading to a lot of infinite URL
spaces, e.g. because of uploaded PDFs that contained relative links.
It also means that URLs with typos could cause the file to fail to
work as expected on some platforms, e.g.
xxx/bitstream/12.34/56/1/photo.jpge would retrieve the file but
because of the '.jpge' the platform wouldn't know what to do with the
file, confusing the user.

A possible alternative is to 302 to the correct path, or a smarter 404
page which contains a link to the correct path and the containing
item.

Either way that URL is only 'semi-persistent' -- in 1.0 (and for MIT)
we decided not to give Handles to bitstreams (and that URL contains
the handle of the item so we can gracefully degrade to the item itself
if the bitstream has gone) though that should be a matter for local
policy.

Rob

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-30 Thread Larry Stone
 On Wed, May 30, 2007 at 02:01:50AM -0400, Larry Stone wrote:
  How about the word resource to introduce the URI, since it is, after
  all, a reference to a resource -- the R in URI.  It'd be:
 
   prefix/resource/encoded-URI  e.g.
 
   http://dspace.me.ac.uk/resource/hdl/1234/56
 
  This follows the proposal to encode the URI by tearing off the scheme
  and putting it in a separate pathname element to avoid issues over
  quoting the :.  Note that I propose using the actual scheme label
  in the URL rather than a user-friendly label, e.g. hdl rather than
  handle.

 This sounds like some reasonable middle-ground. The only issue I can see
 here is that this mechanism only allows us to refer to objects that have
 persistent identifiers. Of course, we could still use an internal form
 of identifier for objects without actual persistent identifiers, but
 then if we have an internal format, should we not use that everywhere?
 Aside from consistency, Mark made the observation that including the
 persistent identifier in the URL is, to a certain extent, bogus. Perhaps
 we could just provide the ability to resolve URLs of the above form, but
 for making links, etc, we use an internal identifier format.

That's a good point -- DSpace is taking on the function of resolving
persistent identifiers like Handles and DOIs when there is no need, since
Handles, at least, already have a Web proxy server.  I wasn't counting
on the add our own flavor of PIDs to DSpace getting resolved favorably..

It _does_ have to allow data model objects to be referenced (through Web
interfaces) by an URL that includes a _persistent_ identifier (as
opposed to, say, a database-ID).  That's the URL that will get used in
links and citations despite our best efforts to promote Handles, so it
needs to be reasonably permanent.

Given a DSpace-specific persistent identifier (e.g. the UUID scheme),
I see two options:

1. Give every content-model object a DSpace-type PID, no matter what.
   External references are URLs including the DSpace PID.
   Other PID schemes (e.g. Handle) resolve to those URLs.
   Allow plugins to register other PIDs when an object is created.

2. Make the DSpace PID into a PersistentIdentifier plugin so it is a
   peer with the Handle or DOI plugins.  The administrator chooses to
   support one or more, and the canonical external reference to an
   object becomes whichever kind of PID is configured to be canonical.

Choice (1) is simpler and seems more sensible, but (2) could be
completely backward-compatible.

Note that some ingested objects will already have PIDs, e.g. if they
are AIPs being re-ingested to reconstruct an archive after catastropic
failure, or DIPs (AIPs) mirrored from another repository.  If _all_
DSpaces have the same PID scheme as in (1), there's no problem ingesting
and accessing another archive's objects.  Under (2), you could end up
generating new PIDs for old objects because your archive doesn't
understand the kind of PID they already have.

I think the UUID scheme (or something like it) makes a whole lot of sense,
but it is a rather significant change.

  Re special characters and quoting: I agree with James' original point that
  the HTTP URL spec has quoting rules for just this reason, but from a
  practical point of view, the client and server implementations have a lot
  of bugs in this area.  That's what I discovered implementing WebDAV for
  the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g.
  within a Handle, because it would just get stomped on differently by
  the different clients.  Better to let it get used literally as a
  path element separator and make the servlet clever enough to figure it out.
  Also, construct the servlet's URL so the whole path after a certain
  point is part of the object URI, e.g. the Handle.

 Again, this sounds fine. The only reason this doesn't work with the
 current implementation with Handles is for referencing bitstreams -- we
 are forced to make assumptions about the structure of the persistent
 identifiers because we use the (arbitrary and unpredictable) filename as
 part of the URL. This must be avoided, whichever scheme we eventually
 use.

Do you mean the way Bitstreams are referenced in a /bitstream/ servlet
URL?  I thought the path actually doesn't matter there -- it can be
anything, the servlet only looks at the sequence ID, because the URL
follows the pattern:

 prefix/bitstream/handle/SequenceID/path
  
e.g.

 http://dspace.mit.edu/bitstream/1721.1/35700/2/60504128-MIT.pdf

...hmm, it didn't _used_ to care what the path was at all, it would
retrieve the bitstream referenced by the Sequence ID.  Now, at
least on the 1.4.1 system I checked, both SID and path have to match.

But it doesn't have to be implemetned that way.  Since Sequence IDs
are the ONLY Bitstream metadata which must be unique within an Item,
the servlet might as well just ignore the path.

(Of course, this ignores the necessity of the /html/ servlet which

Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-29 Thread James Rutherford
On Fri, May 25, 2007 at 03:39:12PM -0500, Brad Teale wrote:
 How do you determine which PI system generates a PId (base it
 on collection, community)?  What if one PI system fails (URL
 unreachable, temporarily down) and it is needed to resolve the PId?
 Could it be possible to create a loop of PIds that resolve to different
 PI systems while moving through the PI system stack?

Something to note is that I don't anticipate objects normally having
more than one identifier. While the prototype allows this, it will still
be the case that objects are only assigned one identifier (according to
configuration -- the details of which are still undecided) but now we
are able to associate multiople identifiers to objects (the stack is
there to define what we understand), and resolve them all to the correct
place.

 3) Including special characters in the URL string doesn't seem like a
 good idea.  While they are valid characters, it does take extra
 processing to encode/decode them from layer to layer.  Why not just
 leave the URL alone or change /handle to something like /uri, /id, or
 /pid?  Why encode the PI system into the URI?

As I mention on the wiki, my current idea is to have URLs of the form:

http://dspace.me.ac.uk/uri/hdl:1234/56

which will resolve to the object with Handle 1234/56, etc. If the
object also has a DOI with value 7890/12 then the following URL would
point to the object as well:

http://dspace.me.ac.uk/uri/doi:7890/12

It is necessary to include the hdl: and doi: parts so we can
distinguish between different persistent identifier mechanisms. The
values allowed for the persistent identifier are dependent on the
mechanism we are dealing with, and as far as possible this will be kept
simple.

 As far as having a default PI system out of the box for Dspace, I would
 recommend using a local identifier schema which used the existing URLs.
  Include the Handle PI system in the release as a configurable option,
 but not turned on by default.  This would remove the fake handle being
 assigned to all objects and clean up the default URLs out of the box.

I've already experimented with a null identifier that can be used to
resolve to objects locally. For example, in my prototype, the following
url would resolve to the Item with internal id 4:

http://dspace.me.ac.uk/uri/dsi:2/4

I'm still not convinced that this is a good idea, but it seems useful
and it makes accessing individual bitstreams a little more predictable
and consistent with the other objects.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as HP CONFIDENTIAL.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-25 Thread Brad Teale
I looked through the Persistent Identifier (PI) wiki page and came up
with a few questions/comments.

1) You created the prototype with a stackable interface, something I
thought about doing, but now I've been wondering if it causes more
problems than its worth.  Why would an institution use more than one PI
system?  How do you determine which PI system generates a PId (base it
on collection, community)?  What if one PI system fails (URL
unreachable, temporarily down) and it is needed to resolve the PId?
Could it be possible to create a loop of PIds that resolve to different
PI systems while moving through the PI system stack?

2)  It is mentioned that HTTP isn't persistent:  Could someone explain
why HTTP isn't as persistent as any other protocol?

3) Including special characters in the URL string doesn't seem like a
good idea.  While they are valid characters, it does take extra
processing to encode/decode them from layer to layer.  Why not just
leave the URL alone or change /handle to something like /uri, /id, or
/pid?  Why encode the PI system into the URI?

4) Assigning bitstreams persistent identifiers seems dangerous.  At the
very least, version control and a history function are required by the
application and PI system to determine if the PId is actually pointing
to what was requested.  Also, how are multiple bitstreams handled when
assigned to an item?  Does each bitstream get a PId?  How does a user
look at all bitstreams associated together by the item when the PId
references only a single bitstream?

As far as having a default PI system out of the box for Dspace, I would
recommend using a local identifier schema which used the existing URLs.
 Include the Handle PI system in the release as a configurable option,
but not turned on by default.  This would remove the fake handle being
assigned to all objects and clean up the default URLs out of the box.

--
Brad



On 05/22/2007 05:06 AM, James Rutherford wrote:
 Hi all,
 
 I've recently started looking into the way DSpace deals (or doesn't)
 with persistent identifiers (prompted in part by patch #1690912 and a
 conversation I had with Mark Diggory). I've put some thoughts on the
 wiki:
 
 http://wiki.dspace.org/index.php/PersistentIdentifiers
 
 and I'd like to gather some input. I've already implemented everything
 discussed on the wiki in a prototype, and it seems to be working well.
 Note that the implementation is being done in parallel with the DAO
 prototype:
 
 http://wiki.dspace.org/index.php/DaoPrototype
 
 The most controversial aspects that I've come up against are:
 
  * deciding which persistent identifier method is used (if more than one
is supported); and
  * what the URLs should look like (http://dspace.me.ac.uk/uri/hdl:12/34
rather than http://dspace.me.ac.uk/handle/12/34, for instance)
 
 
 I'm particularly interested in hearing from folks who already need to
 support other identifiers (PURLs, DOIs, etc), but any input would be
 appreciated.
 
 cheers,
 
 Jim
 

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-23 Thread James Rutherford
On Tue, May 22, 2007 at 02:15:48PM -0700, Han, Yan wrote:
 The wiki mentions that DOI is using http, which is not totally correct.

I know this. The list of persistent identifier mechanisms was only
supposed to be examples of what we could use. I don't intend to actually
build support for DOIs or any other mechanism other than Handles into
DSpace, rather my goal is to make it extremely simple for others to do
so where necessary.

The point of my email wasn't to find out which persistent identifier
mechanism DSpace should use by default, it was to gather opinion on how
we can make DSpace less dependent on one mechanism in a way that isn't
limiting.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as HP CONFIDENTIAL.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-22 Thread James Rutherford
Hi all,

I've recently started looking into the way DSpace deals (or doesn't)
with persistent identifiers (prompted in part by patch #1690912 and a
conversation I had with Mark Diggory). I've put some thoughts on the
wiki:

http://wiki.dspace.org/index.php/PersistentIdentifiers

and I'd like to gather some input. I've already implemented everything
discussed on the wiki in a prototype, and it seems to be working well.
Note that the implementation is being done in parallel with the DAO
prototype:

http://wiki.dspace.org/index.php/DaoPrototype

The most controversial aspects that I've come up against are:

 * deciding which persistent identifier method is used (if more than one
   is supported); and
 * what the URLs should look like (http://dspace.me.ac.uk/uri/hdl:12/34
   rather than http://dspace.me.ac.uk/handle/12/34, for instance)


I'm particularly interested in hearing from folks who already need to
support other identifiers (PURLs, DOIs, etc), but any input would be
appreciated.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as HP CONFIDENTIAL.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech