Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On Tue, 2007-05-29 at 14:58 +0100, James Rutherford wrote: Using UUIDs (as suggested earlier) would *work*, but would produce horrid URLs. Note that I never suggested using UUIDs as part of a URL. What I said is that UUIDs would give you a robust scheme of internal unique identifiers - and in having that, the use of all other identifier schemes are reduced simply to a matter of how you map to/from the UUIDs. We could easily have an out-of-the-box mapping scheme to non-persistent 'friendly' identifiers if the concern is simply to have cleaner URLs. But even if UUIDs where exposed in the URLs (in a default installation), is that necessarily a problem? The ugliness of it would at least encourage people to think about the issues of id persistence / assignment in relation to that repository. By assigning UUIDs as the primary / internal id of all persistent objects in DSpace, we can use tried and tested, well understood algorithms to generate IDs that are virtually guaranteed to be unique, which would open up potential usage / installation scenarios that could otherwise be impractical. It would also have some consistency with the JCR specification, and you've got the potential to make them public, persistent identifiers if that is deemed suitable for a given installation. G This email has been scanned by Postini. For more information please visit http://www.postini.com - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On Tue, 2007-05-29 at 10:44 +0100, James Rutherford wrote: 3) Including special characters in the URL string doesn't seem like a good idea. While they are valid characters, it does take extra processing to encode/decode them from layer to layer. As I mention on the wiki, my current idea is to have URLs of the form: http://dspace.me.ac.uk/uri/hdl:1234/56 which will resolve to the object with Handle 1234/56, etc. If the object also has a DOI with value 7890/12 then the following URL would point to the object as well: http://dspace.me.ac.uk/uri/doi:7890/12 It is necessary to include the hdl: and doi: parts so we can distinguish between different persistent identifier mechanisms. The values allowed for the persistent identifier are dependent on the mechanism we are dealing with, and as far as possible this will be kept simple. Whilst it is necessary to identify the persistent id scheme, that doesn't mean that using a colon as part of the identifier is necessary or desirable. Colons - or other 'unusual' characters - will end up causing problems. In fact, I don't even see that there is a reason to include 'uri' in the url. Why not just support the existing: http://dspace.me.ac.uk/handle/1234/56 for handles, and: http://dspace.me.ac.uk/doi/7890/12 for DOIs, etc.? G This e-mail is confidential and should not be used by anyone who is not the original intended recipient. BioMed Central Limited does not accept liability for any statements made which are clearly the sender's own and not expressly made on behalf of BioMed Central Limited. No contracts may be concluded on behalf of BioMed Central Limited by means of e-mail communication. BioMed Central Limited Registered in England and Wales with registered number 3680030 Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On Tue, May 29, 2007 at 11:12:13AM +0100, Graham Triggs wrote: On Tue, 2007-05-29 at 10:44 +0100, James Rutherford wrote: 3) Including special characters in the URL string doesn't seem like a good idea. While they are valid characters, it does take extra processing to encode/decode them from layer to layer. As I mention on the wiki, my current idea is to have URLs of the form: http://dspace.me.ac.uk/uri/hdl:1234/56 which will resolve to the object with Handle 1234/56, etc. If the object also has a DOI with value 7890/12 then the following URL would point to the object as well: http://dspace.me.ac.uk/uri/doi:7890/12 It is necessary to include the hdl: and doi: parts so we can distinguish between different persistent identifier mechanisms. The values allowed for the persistent identifier are dependent on the mechanism we are dealing with, and as far as possible this will be kept simple. Whilst it is necessary to identify the persistent id scheme, that doesn't mean that using a colon as part of the identifier is necessary or desirable. Colons - or other 'unusual' characters - will end up causing problems. I don't see what's so unusual or undesirable about colons. The reasoning behind doing it this way was so that the value after /uri/ is the canonical form of the identifier. In fact, I don't even see that there is a reason to include 'uri' in the url. Why not just support the existing: http://dspace.me.ac.uk/handle/1234/56 for handles, and: http://dspace.me.ac.uk/doi/7890/12 for DOIs, etc.? This is certainly an option. Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as HP CONFIDENTIAL. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On Tue, 2007-05-29 at 11:43 +0100, James Rutherford wrote: I don't see what's so unusual or undesirable about colons. The reasoning behind doing it this way was so that the value after /uri/ is the canonical form of the identifier. The colon is a reserved character, and in this example would have to be encoded to be strictly valid according to the specifications - which would then mean it isn't the canonical form. Not encoding the colon will have the potential to cause problems with proxies, firewalls, etc. G This email has been scanned by Postini. For more information please visit http://www.postini.com - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On Tue, May 29, 2007 at 11:56:58AM +0100, Graham Triggs wrote: On Tue, 2007-05-29 at 11:43 +0100, James Rutherford wrote: I don't see what's so unusual or undesirable about colons. The reasoning behind doing it this way was so that the value after /uri/ is the canonical form of the identifier. The colon is a reserved character, and in this example would have to be encoded to be strictly valid according to the specifications - which would then mean it isn't the canonical form. Well if we're going to be strict, we should escape the value of the handle 1234/56 as 1234%2F56. Since DSpace already breaks this rule, I didn't deem including a colon as such a great crime ;) Of course, it would be better if we could use an identifier scheme that didn't require escaped characters, but most will at least have a / to separate prefix from suffix. If we're going to be strict, I think I'd favour the following form: http://dspace.me.ac.uk/uri/hdl%3A1234%2F56 or maybe http://dspace.me.ac.uk/uri/hdl/1234%2F56 Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as HP CONFIDENTIAL. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
Hey Folks, PI resolvers come in all shapes and sizes, What all your talking about implementing is proxy/resolution. I would highly recommend NOT conflating the PI resolution mechanism (and why do we even have to have one) with the url path with which a Community, Collection, Item or Bitstream is referenced under in DSpace. What this means is that you do not have a url on with you have to worry about the identifier being properly escaped. You also only have to be concerned with resolving one path to the Item for any PI system. I.E. hdl:1234/5 -- http://dspace.me.ac.uk/item/ABCD and also doi:6789/0 -- http://dspace.me.ac.uk/item/ABCD Don't conflate local and global identification. Cheers, Mark Diggory On May 29, 2007, at 7:52 AM, James Rutherford wrote: On Tue, May 29, 2007 at 11:56:58AM +0100, Graham Triggs wrote: On Tue, 2007-05-29 at 11:43 +0100, James Rutherford wrote: I don't see what's so unusual or undesirable about colons. The reasoning behind doing it this way was so that the value after /uri/ is the canonical form of the identifier. The colon is a reserved character, and in this example would have to be encoded to be strictly valid according to the specifications - which would then mean it isn't the canonical form. Well if we're going to be strict, we should escape the value of the handle 1234/56 as 1234%2F56. Since DSpace already breaks this rule, I didn't deem including a colon as such a great crime ;) Of course, it would be better if we could use an identifier scheme that didn't require escaped characters, but most will at least have a / to separate prefix from suffix. If we're going to be strict, I think I'd favour the following form: http://dspace.me.ac.uk/uri/hdl%3A1234%2F56 or maybe http://dspace.me.ac.uk/uri/hdl/1234%2F56 Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as HP CONFIDENTIAL. -- --- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech ~ Mark R. Diggory - DSpace Systems Manager MIT Libraries, Systems and Technology Services Massachusetts Institute of Technology - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On Tue, 2007-05-29 at 12:52 +0100, James Rutherford wrote: Well if we're going to be strict, we should escape the value of the handle 1234/56 as 1234%2F56. Since DSpace already breaks this rule, I didn't deem including a colon as such a great crime ;) Fair point, and you are probably right. But there is strict and there is strict... and it isn't entirely clear that the handle should be treated as a complete unit rather than the separation of prefix and suffix - globally, that's how they need to be referred to, but then we're discussing local urls here ;-) Yes an unescaped slash isn't going to do anything harmful. An unescaped colon in the middle of the url could easily trigger url parsing bugs and security problems. G This email has been scanned by Postini. For more information please visit http://www.postini.com - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On Tue, May 29, 2007 at 08:21:47AM -0400, Mark Diggory wrote: PI resolvers come in all shapes and sizes, What all your talking about implementing is proxy/resolution. I would highly recommend NOT conflating the PI resolution mechanism (and why do we even have to have one) with the url path with which a Community, Collection, Item or Bitstream is referenced under in DSpace. What this means is that you do not have a url on with you have to worry about the identifier being properly escaped. You also only have to be concerned with resolving one path to the Item for any PI system. I.E. hdl:1234/5 -- http://dspace.me.ac.uk/item/ABCD and also doi:6789/0 -- http://dspace.me.ac.uk/item/ABCD OK, this is fine, but we'll need to define the form that we want for the URL. If we don't use the canonical form of persistent identifiers for this, then we'll need to use another identifier that is unique across the site (presumably, something based on the database id of the object). Using UUIDs (as suggested earlier) would *work*, but would produce horrid URLs. cheers, Jim -- James Rutherford | Hewlett-Packard Limited registered Office: Research Engineer | Cain Road, HP Labs | Bracknell, Bristol, UK | Berks +44 117 312 7066 | RG12 1HN. [EMAIL PROTECTED] | Registered No: 690597 England The contents of this message and any attachments to it are confidential and may be legally privileged. If you have received this message in error, you should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you should consider this message and attachments as HP CONFIDENTIAL. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
Hi, 1) Why would an institution use more than one PI system? How do you determine which PI system generates a PId (base it on collection, community)? There are a lot of theoretical reasons why multiple PI schemes may be in use. Even if you have the simple case of an institute / repository defining a single PI scheme that it always uses for the contents of the repository, depening on what content is being added, there may already be other PIs associated with an item that is being deposited (for example, a published article may have a DOI). Beyond that, you may have repositories that have mandated different PI schemes being merged, and therefore all those existing PIs need to be supported, as well as new ones for the final repository possibly having to be assigned. And with all the issues surrounding 'ownership' and encouraging the use of the repository, it may well prove necessary to support (and mandate) different PI schemes on a community or collection level. 2) It is mentioned that HTTP isn't persistent: Could someone explain why HTTP isn't as persistent as any other protocol? Forget to pay your domain registration fee on time and see how persistent it is ;-) Potentially more problematic, what happens when part (or all) of a repository is migrated into another? Can the domain be transferred to the 'new' location? If not, can URL forwarding be set up on the old URLs? HTTP can provide a unique identifier for an object at a given point in time, but it isn't necessarily going to be possible to rely on it always resolving to the same object over it's entire lifetime. 3) Including special characters in the URL string doesn't seem like a good idea. While they are valid characters, it does take extra processing to encode/decode them from layer to layer. Totally agreed - having colons, etc. in the url is going to lead to problems in some circumstances. 4) Assigning bitstreams persistent identifiers seems dangerous. At the very least, version control and a history function are required by the application and PI system to determine if the PId is actually pointing to what was requested. Also, how are multiple bitstreams handled when assigned to an item? Does each bitstream get a PId? How does a user look at all bitstreams associated together by the item when the PId references only a single bitstream? We had a fair amount of discussion about these issues during the architectural review last year - which were largely centered around extensions to the existing mechanism in order to reference specific (or simply the latest) version of a bitstream as relative to the item. Whether there is a need to assign an 'actual' PI to individual bitstreams or not is very much a policy decision of the repository. Assigning a PI to an individual bitstream does not mean that it happens in lieu of assigning one to the item itself - so if you want to look at other bitstreams associated to the same item, you should use the item PI (and if a user has only been given a PI for a specific bitstream, then they could potentially search for the item that refers to the bitstream identified by that PI). As for versioning, again it's a bit of a policy decision, but a PI could be assigned to a specific revision (and therefore a new revision would get a new PI). You could also have a 'special' PI that would always refer to the latest revision. As far as having a default PI system out of the box for Dspace, I would recommend using a local identifier schema which used the existing URLs. Include the Handle PI system in the release as a configurable option, but not turned on by default. This would remove the fake handle being assigned to all objects and clean up the default URLs out of the box. Well, now to be controversial. IMHO, too much importance is being focused on PIs. Yes, PIs are important for preservation, but that doesn't mean that they have to be treated as something specific and central to DSpace. PIs are 'just' metadata. and supporting multiple ways to resolve a piece (or a combination of pieces) of metadata to an asset - or simplying presenting them in display - isn't really that hard. Now there are special concerns about the handling - ensuring it's presence, automatic generation/assignment, ensuring uniqueness (probably) - but that's all just a question of providing better workflows and metadata handling. In other words, any concerns that we have about how we handle persitent identifiers could be applicable to any piece (or combination) of metadata - and by that token, solving those issues for all metadata would resolve the issues for PIs, just be treating them as 'only' metadata. This would mean that the only id we need to centrally worry about assigning to an asset is a unique id to be resolvable within the repository - ie. a UUID, which would likely be unique across all DSpace instances, and as such could be maintained across migrating from one
Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease
On May 25, 2007, at 6:35 PM, Graham Triggs wrote: Hi, 1) Why would an institution use more than one PI system? How do you determine which PI system generates a PId (base it on collection, community)? There are a lot of theoretical reasons why multiple PI schemes may be in use. Even if you have the simple case of an institute / repository defining a single PI scheme that it always uses for the contents of the repository, depening on what content is being added, there may already be other PIs associated with an item that is being deposited (for example, a published article may have a DOI). Beyond that, you may have repositories that have mandated different PI schemes being merged, and therefore all those existing PIs need to be supported, as well as new ones for the final repository possibly having to be assigned. And with all the issues surrounding 'ownership' and encouraging the use of the repository, it may well prove necessary to support (and mandate) different PI schemes on a community or collection level. 2) It is mentioned that HTTP isn't persistent: Could someone explain why HTTP isn't as persistent as any other protocol? Forget to pay your domain registration fee on time and see how persistent it is ;-) Potentially more problematic, what happens when part (or all) of a repository is migrated into another? Can the domain be transferred to the 'new' location? If not, can URL forwarding be set up on the old URLs? HTTP can provide a unique identifier for an object at a given point in time, but it isn't necessarily going to be possible to rely on it always resolving to the same object over it's entire lifetime. But thats like comparing apples to apple pickers. Forget resolution, an HTTP url is just as much a URI as a Handle or DOI is. If CNRI's global registration and resolving proxy service disappears. What becomes of all the existing handles? Yes, they could possibly be considered persistent, but worth little more than unresolvable strings until a comparable resolution system is reestablished. 3) Including special characters in the URL string doesn't seem like a good idea. While they are valid characters, it does take extra processing to encode/decode them from layer to layer. Totally agreed - having colons, etc. in the url is going to lead to problems in some circumstances. Agreed, for DSpace identifiers, keep them simple for maximal portability into other naming systems. 4) Assigning bitstreams persistent identifiers seems dangerous. At the very least, version control and a history function are required by the application and PI system to determine if the PId is actually pointing to what was requested. Also, how are multiple bitstreams handled when assigned to an item? Does each bitstream get a PId? How does a user look at all bitstreams associated together by the item when the PId references only a single bitstream? We had a fair amount of discussion about these issues during the architectural review last year - which were largely centered around extensions to the existing mechanism in order to reference specific (or simply the latest) version of a bitstream as relative to the item. Whether there is a need to assign an 'actual' PI to individual bitstreams or not is very much a policy decision of the repository. Assigning a PI to an individual bitstream does not mean that it happens in lieu of assigning one to the item itself - so if you want to look at other bitstreams associated to the same item, you should use the item PI (and if a user has only been given a PI for a specific bitstream, then they could potentially search for the item that refers to the bitstream identified by that PI). As for versioning, again it's a bit of a policy decision, but a PI could be assigned to a specific revision (and therefore a new revision would get a new PI). You could also have a 'special' PI that would always refer to the latest revision. As long as the any PI or Bitstream part of an Item PI is controllable and reassignable. For an instance of what not to do, do not take the current sequence id and tack it onto the Item id such that the replacement of a bitstream (because of ingest error or other policy) cannot have the appropriate identifier remapped to it. In DSpace sequence ids can only be assigned to one bitstream, removing that bitstream and adding another results in a new sequence ID. (But actually, this is mostly moot once versioning of Items is introduced). As far as having a default PI system out of the box for Dspace, I would recommend using a local identifier schema which used the existing URLs. Include the Handle PI system in the release as a configurable option, but not turned on by default. This would remove the fake handle being assigned to all objects and clean up the default URLs out of