Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-31 Thread Robert Tansley
On 31/05/07, Larry Stone <[EMAIL PROTECTED]> wrote:

> ...hmm, it didn't _used_ to care what the path was at all, it would
> retrieve the bitstream referenced by the Sequence ID.  Now, at
> least on the 1.4.1 system I checked, both SID and path have to match.
>
> But it doesn't have to be implemetned that way.  Since Sequence IDs
> are the ONLY Bitstream metadata which must be unique within an Item,
> the servlet might as well just ignore the path.

I made this change because it was leading to a lot of infinite URL
spaces, e.g. because of uploaded PDFs that contained relative links.
It also means that URLs with typos could cause the file to fail to
work as expected on some platforms, e.g.
xxx/bitstream/12.34/56/1/photo.jpge would retrieve the file but
because of the '.jpge' the platform wouldn't know what to do with the
file, confusing the user.

A possible alternative is to 302 to the correct path, or a smarter 404
page which contains a link to the correct path and the containing
item.

Either way that URL is only 'semi-persistent' -- in 1.0 (and for MIT)
we decided not to give Handles to bitstreams (and that URL contains
the handle of the item so we can gracefully degrade to the item itself
if the bitstream has gone) though that should be a matter for local
policy.

Rob

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-30 Thread Larry Stone
> On Wed, May 30, 2007 at 02:01:50AM -0400, Larry Stone wrote:
> > How about the word "resource" to introduce the URI, since it is, after
> > all, a reference to a resource -- the "R" in URI.  It'd be:
> >
> >  /resource/  e.g.
> >
> >  http://dspace.me.ac.uk/resource/hdl/1234/56
> >
> > This follows the proposal to encode the URI by tearing off the scheme
> > and putting it in a separate pathname element to avoid issues over
> > quoting the ":".  Note that I propose using the actual scheme label
> > in the URL rather than a user-friendly label, e.g. "hdl" rather than
> > "handle".
>
> This sounds like some reasonable middle-ground. The only issue I can see
> here is that this mechanism only allows us to refer to objects that have
> persistent identifiers. Of course, we could still use an "internal" form
> of identifier for objects without actual persistent identifiers, but
> then if we have an internal format, should we not use that everywhere?
> Aside from consistency, Mark made the observation that including the
> persistent identifier in the URL is, to a certain extent, bogus. Perhaps
> we could just provide the ability to resolve URLs of the above form, but
> for making links, etc, we use an internal identifier format.

That's a good point -- DSpace is taking on the function of resolving
persistent identifiers like Handles and DOIs when there is no need, since
Handles, at least, already have a Web proxy server.  I wasn't counting
on the "add our own flavor of PIDs to DSpace" getting resolved favorably..

It _does_ have to allow data model objects to be referenced (through Web
interfaces) by an URL that includes a _persistent_ identifier (as
opposed to, say, a database-ID).  That's the URL that will get used in
links and citations despite our best efforts to promote Handles, so it
needs to be reasonably permanent.

Given a DSpace-specific persistent identifier (e.g. the UUID scheme),
I see two options:

1. Give every content-model object a DSpace-type PID, no matter what.
   External references are URLs including the DSpace PID.
   Other PID schemes (e.g. Handle) resolve to those URLs.
   Allow plugins to register other PIDs when an object is created.

2. Make the "DSpace" PID into a PersistentIdentifier plugin so it is a
   peer with the Handle or DOI plugins.  The administrator chooses to
   support one or more, and the canonical external reference to an
   object becomes whichever kind of PID is configured to be canonical.

Choice (1) is simpler and seems more sensible, but (2) could be
completely backward-compatible.

Note that some ingested objects will already have PIDs, e.g. if they
are AIPs being re-ingested to reconstruct an archive after catastropic
failure, or DIPs (AIPs) mirrored from another repository.  If _all_
DSpaces have the same PID scheme as in (1), there's no problem ingesting
and accessing another archive's objects.  Under (2), you could end up
generating new PIDs for old objects because your archive doesn't
understand the kind of PID they already have.

I think the UUID scheme (or something like it) makes a whole lot of sense,
but it is a rather significant change.

> > Re special characters and quoting: I agree with James' original point that
> > the HTTP URL spec has quoting rules for just this reason, but from a
> > practical point of view, the client and server implementations have a lot
> > of bugs in this area.  That's what I discovered implementing WebDAV for
> > the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g.
> > within a Handle, because it would just get stomped on differently by
> > the different clients.  Better to let it get used literally as a
> > path element separator and make the servlet clever enough to figure it out.
> > Also, construct the servlet's URL so the whole path after a certain
> > point is part of the object URI, e.g. the Handle.
>
> Again, this sounds fine. The only reason this doesn't work with the
> current implementation with Handles is for referencing bitstreams -- we
> are forced to make assumptions about the structure of the persistent
> identifiers because we use the (arbitrary and unpredictable) filename as
> part of the URL. This must be avoided, whichever scheme we eventually
> use.

Do you mean the way Bitstreams are referenced in a "/bitstream/" servlet
URL?  I thought the path actually doesn't matter there -- it can be
anything, the servlet only looks at the sequence ID, because the URL
follows the pattern:

 /bitstream///
  
e.g.

 http://dspace.mit.edu/bitstream/1721.1/35700/2/60504128-MIT.pdf

...hmm, it didn't _used_ to care what the path was at all, it would
retrieve the bitstream referenced by the Sequence ID.  Now, at
least on the 1.4.1 system I checked, both SID and path have to match.

But it doesn't have to be implemetned that way.  Since Sequence IDs
are the ONLY Bitstream metadata which must be unique within an Item,
the servlet might as well just ignore the path.

(Of course, this ignores the necessi

Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-30 Thread James Rutherford
On Wed, May 30, 2007 at 02:01:50AM -0400, Larry Stone wrote:
> How about the word "resource" to introduce the URI, since it is, after
> all, a reference to a resource -- the "R" in URI.  It'd be:
> 
>  /resource/  e.g.
> 
>  http://dspace.me.ac.uk/resource/hdl/1234/56
>
> This follows the proposal to encode the URI by tearing off the scheme
> and putting it in a separate pathname element to avoid issues over
> quoting the ":".  Note that I propose using the actual scheme label
> in the URL rather than a user-friendly label, e.g. "hdl" rather than
> "handle".

This sounds like some reasonable middle-ground. The only issue I can see
here is that this mechanism only allows us to refer to objects that have
persistent identifiers. Of course, we could still use an "internal" form
of identifier for objects without actual persistent identifiers, but
then if we have an internal format, should we not use that everywhere?
Aside from consistency, Mark made the observation that including the
persistent identifier in the URL is, to a certain extent, bogus. Perhaps
we could just provide the ability to resolve URLs of the above form, but
for making links, etc, we use an internal identifier format.

> Re special characters and quoting: I agree with James' original point that
> the HTTP URL spec has quoting rules for just this reason, but from a
> practical point of view, the client and server implementations have a lot
> of bugs in this area.  That's what I discovered implementing WebDAV for
> the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g.
> within a Handle, because it would just get stomped on differently by
> the different clients.  Better to let it get used literally as a
> path element separator and make the servlet clever enough to figure it out.
> Also, construct the servlet's URL so the whole path after a certain
> point is part of the object URI, e.g. the Handle.

Again, this sounds fine. The only reason this doesn't work with the
current implementation with Handles is for referencing bitstreams -- we
are forced to make assumptions about the structure of the persistent
identifiers because we use the (arbitrary and unpredictable) filename as
part of the URL. This must be avoided, whichever scheme we eventually
use.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as "HP CONFIDENTIAL".

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-29 Thread Larry Stone
About the URLs to access objects, it's really a matter for each
DSpace UI webapp to implement, isn't it?  I think it's important to
include a pathname component to introduce the encoded(*) object-URI,
however, just to protect against namespace collisions -- e.g. what if
the UI design had one of its own URLs starting with "doi"?

How about the word "resource" to introduce the URI, since it is, after
all, a reference to a resource -- the "R" in URI.  It'd be:

 /resource/  e.g.

 http://dspace.me.ac.uk/resource/hdl/1234/56

This follows the proposal to encode the URI by tearing off the scheme
and putting it in a separate pathname element to avoid issues over
quoting the ":".  Note that I propose using the actual scheme label
in the URL rather than a user-friendly label, e.g. "hdl" rather than "handle".

Re special characters and quoting: I agree with James' original point that
the HTTP URL spec has quoting rules for just this reason, but from a
practical point of view, the client and server implementations have a lot
of bugs in this area.  That's what I discovered implementing WebDAV for
the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g.
within a Handle, because it would just get stomped on differently by
the different clients.  Better to let it get used literally as a
path element separator and make the servlet clever enough to figure it out.
Also, construct the servlet's URL so the whole path after a certain
point is part of the object URI, e.g. the Handle.

Also note that the transformation from object URI to the actionable
URL is reversible -- we can pull the URI's scheme and path right out
of the URL and put it back together unambiguously.  I think  it's
essential to have one unambiguous transformation for all URIs.

> > 3) Including special characters in the URL string doesn't seem like a
> > good idea.  While they are valid characters, it does take extra
> > processing to encode/decode them from layer to layer.  Why not just
> > leave the URL alone or change /handle to something like /uri, /id, or
> > /pid?  Why encode the PI system into the URI?
>
> As I mention on the wiki, my current idea is to have URLs of the form:
>
> http://dspace.me.ac.uk/uri/hdl:1234/56
>
> which will resolve to the object with Handle 1234/56, etc. If the
> object also has a DOI with value 7890/12 then the following URL would
> point to the object as well:
>
> http://dspace.me.ac.uk/uri/doi:7890/12
>
> It is necessary to include the "hdl:" and "doi:" parts so we can
> distinguish between different persistent identifier mechanisms. The
> values allowed for the persistent identifier are dependent on the
> mechanism we are dealing with, and as far as possible this will be kept
> simple.

-- Larry


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-29 Thread James Rutherford
On Fri, May 25, 2007 at 03:39:12PM -0500, Brad Teale wrote:
> How do you determine which PI system generates a PId (base it
> on collection, community)?  What if one PI system fails (URL
> unreachable, temporarily down) and it is needed to resolve the PId?
> Could it be possible to create a loop of PIds that resolve to different
> PI systems while moving through the PI system stack?

Something to note is that I don't anticipate objects normally having
more than one identifier. While the prototype allows this, it will still
be the case that objects are only assigned one identifier (according to
configuration -- the details of which are still undecided) but now we
are able to associate multiople identifiers to objects (the stack is
there to define what we understand), and resolve them all to the correct
place.

> 3) Including special characters in the URL string doesn't seem like a
> good idea.  While they are valid characters, it does take extra
> processing to encode/decode them from layer to layer.  Why not just
> leave the URL alone or change /handle to something like /uri, /id, or
> /pid?  Why encode the PI system into the URI?

As I mention on the wiki, my current idea is to have URLs of the form:

http://dspace.me.ac.uk/uri/hdl:1234/56

which will resolve to the object with Handle 1234/56, etc. If the
object also has a DOI with value 7890/12 then the following URL would
point to the object as well:

http://dspace.me.ac.uk/uri/doi:7890/12

It is necessary to include the "hdl:" and "doi:" parts so we can
distinguish between different persistent identifier mechanisms. The
values allowed for the persistent identifier are dependent on the
mechanism we are dealing with, and as far as possible this will be kept
simple.

> As far as having a default PI system out of the box for Dspace, I would
> recommend using a local identifier schema which used the existing URLs.
>  Include the Handle PI system in the release as a configurable option,
> but not turned on by default.  This would remove the fake handle being
> assigned to all objects and clean up the default URLs out of the box.

I've already experimented with a "null" identifier that can be used to
resolve to objects locally. For example, in my prototype, the following
url would resolve to the Item with internal id 4:

http://dspace.me.ac.uk/uri/dsi:2/4

I'm still not convinced that this is a good idea, but it seems useful
and it makes accessing individual bitstreams a little more predictable
and consistent with the other objects.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as "HP CONFIDENTIAL".

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-25 Thread Brad Teale
I looked through the Persistent Identifier (PI) wiki page and came up
with a few questions/comments.

1) You created the prototype with a stackable interface, something I
thought about doing, but now I've been wondering if it causes more
problems than its worth.  Why would an institution use more than one PI
system?  How do you determine which PI system generates a PId (base it
on collection, community)?  What if one PI system fails (URL
unreachable, temporarily down) and it is needed to resolve the PId?
Could it be possible to create a loop of PIds that resolve to different
PI systems while moving through the PI system stack?

2)  It is mentioned that HTTP isn't "persistent":  Could someone explain
why HTTP isn't as persistent as any other protocol?

3) Including special characters in the URL string doesn't seem like a
good idea.  While they are valid characters, it does take extra
processing to encode/decode them from layer to layer.  Why not just
leave the URL alone or change /handle to something like /uri, /id, or
/pid?  Why encode the PI system into the URI?

4) Assigning bitstreams persistent identifiers seems dangerous.  At the
very least, version control and a history function are required by the
application and PI system to determine if the PId is actually pointing
to what was requested.  Also, how are multiple bitstreams handled when
assigned to an item?  Does each bitstream get a PId?  How does a user
look at all bitstreams associated together by the item when the PId
references only a single bitstream?

As far as having a default PI system out of the box for Dspace, I would
recommend using a local identifier schema which used the existing URLs.
 Include the Handle PI system in the release as a configurable option,
but not turned on by default.  This would remove the fake handle being
assigned to all objects and clean up the default URLs out of the box.

--
Brad



On 05/22/2007 05:06 AM, James Rutherford wrote:
> Hi all,
> 
> I've recently started looking into the way DSpace deals (or doesn't)
> with persistent identifiers (prompted in part by patch #1690912 and a
> conversation I had with Mark Diggory). I've put some thoughts on the
> wiki:
> 
> http://wiki.dspace.org/index.php/PersistentIdentifiers
> 
> and I'd like to gather some input. I've already implemented everything
> discussed on the wiki in a prototype, and it seems to be working well.
> Note that the implementation is being done in parallel with the DAO
> prototype:
> 
> http://wiki.dspace.org/index.php/DaoPrototype
> 
> The most controversial aspects that I've come up against are:
> 
>  * deciding which persistent identifier method is used (if more than one
>is supported); and
>  * what the URLs should look like (http://dspace.me.ac.uk/uri/hdl:12/34
>rather than http://dspace.me.ac.uk/handle/12/34, for instance)
> 
> 
> I'm particularly interested in hearing from folks who already need to
> support other identifiers (PURLs, DOIs, etc), but any input would be
> appreciated.
> 
> cheers,
> 
> Jim
> 

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-23 Thread James Rutherford
On Tue, May 22, 2007 at 02:15:48PM -0700, Han, Yan wrote:
> The wiki mentions that DOI is using http, which is not totally correct.

I know this. The list of persistent identifier mechanisms was only
supposed to be examples of what we could use. I don't intend to actually
build support for DOIs or any other mechanism other than Handles into
DSpace, rather my goal is to make it extremely simple for others to do
so where necessary.

The point of my email wasn't to find out which persistent identifier
mechanism DSpace should use by default, it was to gather opinion on how
we can make DSpace less dependent on one mechanism in a way that isn't
limiting.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as "HP CONFIDENTIAL".

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] Persistent identifiers in DSpace -- thoughts please

2007-05-22 Thread James Rutherford
Hi all,

I've recently started looking into the way DSpace deals (or doesn't)
with persistent identifiers (prompted in part by patch #1690912 and a
conversation I had with Mark Diggory). I've put some thoughts on the
wiki:

http://wiki.dspace.org/index.php/PersistentIdentifiers

and I'd like to gather some input. I've already implemented everything
discussed on the wiki in a prototype, and it seems to be working well.
Note that the implementation is being done in parallel with the DAO
prototype:

http://wiki.dspace.org/index.php/DaoPrototype

The most controversial aspects that I've come up against are:

 * deciding which persistent identifier method is used (if more than one
   is supported); and
 * what the URLs should look like (http://dspace.me.ac.uk/uri/hdl:12/34
   rather than http://dspace.me.ac.uk/handle/12/34, for instance)


I'm particularly interested in hearing from folks who already need to
support other identifiers (PURLs, DOIs, etc), but any input would be
appreciated.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as "HP CONFIDENTIAL".

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech