Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please
On 31/05/07, Larry Stone <[EMAIL PROTECTED]> wrote: > ...hmm, it didn't _used_ to care what the path was at all, it would > retrieve the bitstream referenced by the Sequence ID. Now, at > least on the 1.4.1 system I checked, both SID and path have to match. > > But it doesn't have to be implemetned that way. Since Sequence IDs > are the ONLY Bitstream metadata which must be unique within an Item, > the servlet might as well just ignore the path. I made this change because it was leading to a lot of infinite URL spaces, e.g. because of uploaded PDFs that contained relative links. It also means that URLs with typos could cause the file to fail to work as expected on some platforms, e.g. xxx/bitstream/12.34/56/1/photo.jpge would retrieve the file but because of the '.jpge' the platform wouldn't know what to do with the file, confusing the user. A possible alternative is to 302 to the correct path, or a smarter 404 page which contains a link to the correct path and the containing item. Either way that URL is only 'semi-persistent' -- in 1.0 (and for MIT) we decided not to give Handles to bitstreams (and that URL contains the handle of the item so we can gracefully degrade to the item itself if the bitstream has gone) though that should be a matter for local policy. Rob - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please
> On Wed, May 30, 2007 at 02:01:50AM -0400, Larry Stone wrote: > > How about the word "resource" to introduce the URI, since it is, after > > all, a reference to a resource -- the "R" in URI. It'd be: > > > > /resource/ e.g. > > > > http://dspace.me.ac.uk/resource/hdl/1234/56 > > > > This follows the proposal to encode the URI by tearing off the scheme > > and putting it in a separate pathname element to avoid issues over > > quoting the ":". Note that I propose using the actual scheme label > > in the URL rather than a user-friendly label, e.g. "hdl" rather than > > "handle". > > This sounds like some reasonable middle-ground. The only issue I can see > here is that this mechanism only allows us to refer to objects that have > persistent identifiers. Of course, we could still use an "internal" form > of identifier for objects without actual persistent identifiers, but > then if we have an internal format, should we not use that everywhere? > Aside from consistency, Mark made the observation that including the > persistent identifier in the URL is, to a certain extent, bogus. Perhaps > we could just provide the ability to resolve URLs of the above form, but > for making links, etc, we use an internal identifier format. That's a good point -- DSpace is taking on the function of resolving persistent identifiers like Handles and DOIs when there is no need, since Handles, at least, already have a Web proxy server. I wasn't counting on the "add our own flavor of PIDs to DSpace" getting resolved favorably.. It _does_ have to allow data model objects to be referenced (through Web interfaces) by an URL that includes a _persistent_ identifier (as opposed to, say, a database-ID). That's the URL that will get used in links and citations despite our best efforts to promote Handles, so it needs to be reasonably permanent. Given a DSpace-specific persistent identifier (e.g. the UUID scheme), I see two options: 1. Give every content-model object a DSpace-type PID, no matter what. External references are URLs including the DSpace PID. Other PID schemes (e.g. Handle) resolve to those URLs. Allow plugins to register other PIDs when an object is created. 2. Make the "DSpace" PID into a PersistentIdentifier plugin so it is a peer with the Handle or DOI plugins. The administrator chooses to support one or more, and the canonical external reference to an object becomes whichever kind of PID is configured to be canonical. Choice (1) is simpler and seems more sensible, but (2) could be completely backward-compatible. Note that some ingested objects will already have PIDs, e.g. if they are AIPs being re-ingested to reconstruct an archive after catastropic failure, or DIPs (AIPs) mirrored from another repository. If _all_ DSpaces have the same PID scheme as in (1), there's no problem ingesting and accessing another archive's objects. Under (2), you could end up generating new PIDs for old objects because your archive doesn't understand the kind of PID they already have. I think the UUID scheme (or something like it) makes a whole lot of sense, but it is a rather significant change. > > Re special characters and quoting: I agree with James' original point that > > the HTTP URL spec has quoting rules for just this reason, but from a > > practical point of view, the client and server implementations have a lot > > of bugs in this area. That's what I discovered implementing WebDAV for > > the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g. > > within a Handle, because it would just get stomped on differently by > > the different clients. Better to let it get used literally as a > > path element separator and make the servlet clever enough to figure it out. > > Also, construct the servlet's URL so the whole path after a certain > > point is part of the object URI, e.g. the Handle. > > Again, this sounds fine. The only reason this doesn't work with the > current implementation with Handles is for referencing bitstreams -- we > are forced to make assumptions about the structure of the persistent > identifiers because we use the (arbitrary and unpredictable) filename as > part of the URL. This must be avoided, whichever scheme we eventually > use. Do you mean the way Bitstreams are referenced in a "/bitstream/" servlet URL? I thought the path actually doesn't matter there -- it can be anything, the servlet only looks at the sequence ID, because the URL follows the pattern: /bitstream/// e.g. http://dspace.mit.edu/bitstream/1721.1/35700/2/60504128-MIT.pdf ...hmm, it didn't _used_ to care what the path was at all, it would retrieve the bitstream referenced by the Sequence ID. Now, at least on the 1.4.1 system I checked, both SID and path have to match. But it doesn't have to be implemetned that way. Since Sequence IDs are the ONLY Bitstream metadata which must be unique within an Item, the servlet might as well just ignore the path. (Of course, this ignores the necessi
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please
On Wed, May 30, 2007 at 02:01:50AM -0400, Larry Stone wrote: > How about the word "resource" to introduce the URI, since it is, after > all, a reference to a resource -- the "R" in URI. It'd be: > > /resource/ e.g. > > http://dspace.me.ac.uk/resource/hdl/1234/56 > > This follows the proposal to encode the URI by tearing off the scheme > and putting it in a separate pathname element to avoid issues over > quoting the ":". Note that I propose using the actual scheme label > in the URL rather than a user-friendly label, e.g. "hdl" rather than > "handle". This sounds like some reasonable middle-ground. The only issue I can see here is that this mechanism only allows us to refer to objects that have persistent identifiers. Of course, we could still use an "internal" form of identifier for objects without actual persistent identifiers, but then if we have an internal format, should we not use that everywhere? Aside from consistency, Mark made the observation that including the persistent identifier in the URL is, to a certain extent, bogus. Perhaps we could just provide the ability to resolve URLs of the above form, but for making links, etc, we use an internal identifier format. > Re special characters and quoting: I agree with James' original point that > the HTTP URL spec has quoting rules for just this reason, but from a > practical point of view, the client and server implementations have a lot > of bugs in this area. That's what I discovered implementing WebDAV for > the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g. > within a Handle, because it would just get stomped on differently by > the different clients. Better to let it get used literally as a > path element separator and make the servlet clever enough to figure it out. > Also, construct the servlet's URL so the whole path after a certain > point is part of the object URI, e.g. the Handle. Again, this sounds fine. The only reason this doesn't work with the current implementation with Handles is for referencing bitstreams -- we are forced to make assumptions about the structure of the persistent identifiers because we use the (arbitrary and unpredictable) filename as part of the URL. This must be avoided, whichever scheme we eventually use. cheers, Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please
About the URLs to access objects, it's really a matter for each DSpace UI webapp to implement, isn't it? I think it's important to include a pathname component to introduce the encoded(*) object-URI, however, just to protect against namespace collisions -- e.g. what if the UI design had one of its own URLs starting with "doi"? How about the word "resource" to introduce the URI, since it is, after all, a reference to a resource -- the "R" in URI. It'd be: /resource/ e.g. http://dspace.me.ac.uk/resource/hdl/1234/56 This follows the proposal to encode the URI by tearing off the scheme and putting it in a separate pathname element to avoid issues over quoting the ":". Note that I propose using the actual scheme label in the URL rather than a user-friendly label, e.g. "hdl" rather than "handle". Re special characters and quoting: I agree with James' original point that the HTTP URL spec has quoting rules for just this reason, but from a practical point of view, the client and server implementations have a lot of bugs in this area. That's what I discovered implementing WebDAV for the LNI: it wasn't worth trying to encode a slash (/) in a URL, e.g. within a Handle, because it would just get stomped on differently by the different clients. Better to let it get used literally as a path element separator and make the servlet clever enough to figure it out. Also, construct the servlet's URL so the whole path after a certain point is part of the object URI, e.g. the Handle. Also note that the transformation from object URI to the actionable URL is reversible -- we can pull the URI's scheme and path right out of the URL and put it back together unambiguously. I think it's essential to have one unambiguous transformation for all URIs. > > 3) Including special characters in the URL string doesn't seem like a > > good idea. While they are valid characters, it does take extra > > processing to encode/decode them from layer to layer. Why not just > > leave the URL alone or change /handle to something like /uri, /id, or > > /pid? Why encode the PI system into the URI? > > As I mention on the wiki, my current idea is to have URLs of the form: > > http://dspace.me.ac.uk/uri/hdl:1234/56 > > which will resolve to the object with Handle 1234/56, etc. If the > object also has a DOI with value 7890/12 then the following URL would > point to the object as well: > > http://dspace.me.ac.uk/uri/doi:7890/12 > > It is necessary to include the "hdl:" and "doi:" parts so we can > distinguish between different persistent identifier mechanisms. The > values allowed for the persistent identifier are dependent on the > mechanism we are dealing with, and as far as possible this will be kept > simple. -- Larry - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please
On Fri, May 25, 2007 at 03:39:12PM -0500, Brad Teale wrote: > How do you determine which PI system generates a PId (base it > on collection, community)? What if one PI system fails (URL > unreachable, temporarily down) and it is needed to resolve the PId? > Could it be possible to create a loop of PIds that resolve to different > PI systems while moving through the PI system stack? Something to note is that I don't anticipate objects normally having more than one identifier. While the prototype allows this, it will still be the case that objects are only assigned one identifier (according to configuration -- the details of which are still undecided) but now we are able to associate multiople identifiers to objects (the stack is there to define what we understand), and resolve them all to the correct place. > 3) Including special characters in the URL string doesn't seem like a > good idea. While they are valid characters, it does take extra > processing to encode/decode them from layer to layer. Why not just > leave the URL alone or change /handle to something like /uri, /id, or > /pid? Why encode the PI system into the URI? As I mention on the wiki, my current idea is to have URLs of the form: http://dspace.me.ac.uk/uri/hdl:1234/56 which will resolve to the object with Handle 1234/56, etc. If the object also has a DOI with value 7890/12 then the following URL would point to the object as well: http://dspace.me.ac.uk/uri/doi:7890/12 It is necessary to include the "hdl:" and "doi:" parts so we can distinguish between different persistent identifier mechanisms. The values allowed for the persistent identifier are dependent on the mechanism we are dealing with, and as far as possible this will be kept simple. > As far as having a default PI system out of the box for Dspace, I would > recommend using a local identifier schema which used the existing URLs. > Include the Handle PI system in the release as a configurable option, > but not turned on by default. This would remove the fake handle being > assigned to all objects and clean up the default URLs out of the box. I've already experimented with a "null" identifier that can be used to resolve to objects locally. For example, in my prototype, the following url would resolve to the Item with internal id 4: http://dspace.me.ac.uk/uri/dsi:2/4 I'm still not convinced that this is a good idea, but it seems useful and it makes accessing individual bitstreams a little more predictable and consistent with the other objects. cheers, Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please
I looked through the Persistent Identifier (PI) wiki page and came up with a few questions/comments. 1) You created the prototype with a stackable interface, something I thought about doing, but now I've been wondering if it causes more problems than its worth. Why would an institution use more than one PI system? How do you determine which PI system generates a PId (base it on collection, community)? What if one PI system fails (URL unreachable, temporarily down) and it is needed to resolve the PId? Could it be possible to create a loop of PIds that resolve to different PI systems while moving through the PI system stack? 2) It is mentioned that HTTP isn't "persistent": Could someone explain why HTTP isn't as persistent as any other protocol? 3) Including special characters in the URL string doesn't seem like a good idea. While they are valid characters, it does take extra processing to encode/decode them from layer to layer. Why not just leave the URL alone or change /handle to something like /uri, /id, or /pid? Why encode the PI system into the URI? 4) Assigning bitstreams persistent identifiers seems dangerous. At the very least, version control and a history function are required by the application and PI system to determine if the PId is actually pointing to what was requested. Also, how are multiple bitstreams handled when assigned to an item? Does each bitstream get a PId? How does a user look at all bitstreams associated together by the item when the PId references only a single bitstream? As far as having a default PI system out of the box for Dspace, I would recommend using a local identifier schema which used the existing URLs. Include the Handle PI system in the release as a configurable option, but not turned on by default. This would remove the fake handle being assigned to all objects and clean up the default URLs out of the box. -- Brad On 05/22/2007 05:06 AM, James Rutherford wrote: > Hi all, > > I've recently started looking into the way DSpace deals (or doesn't) > with persistent identifiers (prompted in part by patch #1690912 and a > conversation I had with Mark Diggory). I've put some thoughts on the > wiki: > > http://wiki.dspace.org/index.php/PersistentIdentifiers > > and I'd like to gather some input. I've already implemented everything > discussed on the wiki in a prototype, and it seems to be working well. > Note that the implementation is being done in parallel with the DAO > prototype: > > http://wiki.dspace.org/index.php/DaoPrototype > > The most controversial aspects that I've come up against are: > > * deciding which persistent identifier method is used (if more than one >is supported); and > * what the URLs should look like (http://dspace.me.ac.uk/uri/hdl:12/34 >rather than http://dspace.me.ac.uk/handle/12/34, for instance) > > > I'm particularly interested in hearing from folks who already need to > support other identifiers (PURLs, DOIs, etc), but any input would be > appreciated. > > cheers, > > Jim > - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughts please
On Tue, May 22, 2007 at 02:15:48PM -0700, Han, Yan wrote: > The wiki mentions that DOI is using http, which is not totally correct. I know this. The list of persistent identifier mechanisms was only supposed to be examples of what we could use. I don't intend to actually build support for DOIs or any other mechanism other than Handles into DSpace, rather my goal is to make it extremely simple for others to do so where necessary. The point of my email wasn't to find out which persistent identifier mechanism DSpace should use by default, it was to gather opinion on how we can make DSpace less dependent on one mechanism in a way that isn't limiting. cheers, Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] Persistent identifiers in DSpace -- thoughts please
Hi all, I've recently started looking into the way DSpace deals (or doesn't) with persistent identifiers (prompted in part by patch #1690912 and a conversation I had with Mark Diggory). I've put some thoughts on the wiki: http://wiki.dspace.org/index.php/PersistentIdentifiers and I'd like to gather some input. I've already implemented everything discussed on the wiki in a prototype, and it seems to be working well. Note that the implementation is being done in parallel with the DAO prototype: http://wiki.dspace.org/index.php/DaoPrototype The most controversial aspects that I've come up against are: * deciding which persistent identifier method is used (if more than one is supported); and * what the URLs should look like (http://dspace.me.ac.uk/uri/hdl:12/34 rather than http://dspace.me.ac.uk/handle/12/34, for instance) I'm particularly interested in hearing from folks who already need to support other identifiers (PURLs, DOIs, etc), but any input would be appreciated. cheers, Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as "HP CONFIDENTIAL". - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech