Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-30 Thread Graham Triggs
On Tue, 2007-05-29 at 14:58 +0100, James Rutherford wrote:
 Using UUIDs (as suggested earlier) would *work*, but would produce
 horrid URLs.

Note that I never suggested using UUIDs as part of a URL. What I said is
that UUIDs would give you a robust scheme of internal unique identifiers
- and in having that, the use of all other identifier schemes are
reduced simply to a matter of how you map to/from the UUIDs.

We could easily have an out-of-the-box mapping scheme to non-persistent
'friendly' identifiers if the concern is simply to have cleaner URLs.

But even if UUIDs where exposed in the URLs (in a default installation),
is that necessarily a problem? The ugliness of it would at least
encourage people to think about the issues of id persistence /
assignment in relation to that repository.

By assigning UUIDs as the primary / internal id of all persistent
objects in DSpace, we can use tried and tested, well understood
algorithms to generate IDs that are virtually guaranteed to be unique,
which would open up potential usage / installation scenarios that could
otherwise be impractical. It would also have some consistency with the
JCR specification, and you've got the potential to make them public,
persistent identifiers if that is deemed suitable for a given
installation.

G
This email has been scanned by Postini.
For more information please visit http://www.postini.com


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-29 Thread Graham Triggs
On Tue, 2007-05-29 at 10:44 +0100, James Rutherford wrote:
  3) Including special characters in the URL string doesn't seem like a
  good idea.  While they are valid characters, it does take extra
  processing to encode/decode them from layer to layer.
 
 As I mention on the wiki, my current idea is to have URLs of the form:
 
 http://dspace.me.ac.uk/uri/hdl:1234/56
 
 which will resolve to the object with Handle 1234/56, etc. If the
 object also has a DOI with value 7890/12 then the following URL would
 point to the object as well:
 
 http://dspace.me.ac.uk/uri/doi:7890/12
 
 It is necessary to include the hdl: and doi: parts so we can
 distinguish between different persistent identifier mechanisms. The
 values allowed for the persistent identifier are dependent on the
 mechanism we are dealing with, and as far as possible this will be kept
 simple.

Whilst it is necessary to identify the persistent id scheme, that
doesn't mean that using a colon as part of the identifier is necessary
or desirable. Colons - or other 'unusual' characters - will end up
causing problems.

In fact, I don't even see that there is a reason to include 'uri' in the
url. Why not just support the existing:

http://dspace.me.ac.uk/handle/1234/56

for handles, and:

http://dspace.me.ac.uk/doi/7890/12

for DOIs, etc.?

G 
 
 
This e-mail is confidential and should not be used by anyone who is not the 
original intended recipient. BioMed Central Limited does not accept liability 
for any statements made which are clearly the sender's own and not expressly 
made on behalf of BioMed Central Limited. No contracts may be concluded on 
behalf of BioMed Central Limited by means of e-mail communication. BioMed 
Central Limited Registered in England and Wales with registered number 3680030 
Registered Office Middlesex House, 34-42 Cleveland Street, London W1T 4LB

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-29 Thread James Rutherford
On Tue, May 29, 2007 at 11:12:13AM +0100, Graham Triggs wrote:
 On Tue, 2007-05-29 at 10:44 +0100, James Rutherford wrote:
   3) Including special characters in the URL string doesn't seem like a
   good idea.  While they are valid characters, it does take extra
   processing to encode/decode them from layer to layer.
  
  As I mention on the wiki, my current idea is to have URLs of the form:
  
  http://dspace.me.ac.uk/uri/hdl:1234/56
  
  which will resolve to the object with Handle 1234/56, etc. If the
  object also has a DOI with value 7890/12 then the following URL would
  point to the object as well:
  
  http://dspace.me.ac.uk/uri/doi:7890/12
  
  It is necessary to include the hdl: and doi: parts so we can
  distinguish between different persistent identifier mechanisms. The
  values allowed for the persistent identifier are dependent on the
  mechanism we are dealing with, and as far as possible this will be kept
  simple.
 
 Whilst it is necessary to identify the persistent id scheme, that
 doesn't mean that using a colon as part of the identifier is necessary
 or desirable. Colons - or other 'unusual' characters - will end up
 causing problems.

I don't see what's so unusual or undesirable about colons. The reasoning
behind doing it this way was so that the value after /uri/ is the
canonical form of the identifier.

 In fact, I don't even see that there is a reason to include 'uri' in the
 url. Why not just support the existing:
 
 http://dspace.me.ac.uk/handle/1234/56
 
 for handles, and:
 
 http://dspace.me.ac.uk/doi/7890/12
 
 for DOIs, etc.?

This is certainly an option.

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as HP CONFIDENTIAL.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-29 Thread Graham Triggs
On Tue, 2007-05-29 at 11:43 +0100, James Rutherford wrote:
 I don't see what's so unusual or undesirable about colons. The reasoning
 behind doing it this way was so that the value after /uri/ is the
 canonical form of the identifier.

The colon is a reserved character, and in this example would have to be
encoded to be strictly valid according to the specifications - which
would then mean it isn't the canonical form.

Not encoding the colon will have the potential to cause problems with
proxies, firewalls, etc.

G
This email has been scanned by Postini.
For more information please visit http://www.postini.com


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-29 Thread James Rutherford
On Tue, May 29, 2007 at 11:56:58AM +0100, Graham Triggs wrote:
 On Tue, 2007-05-29 at 11:43 +0100, James Rutherford wrote:
  I don't see what's so unusual or undesirable about colons. The reasoning
  behind doing it this way was so that the value after /uri/ is the
  canonical form of the identifier.
 
 The colon is a reserved character, and in this example would have to be
 encoded to be strictly valid according to the specifications - which
 would then mean it isn't the canonical form.

Well if we're going to be strict, we should escape the value of the
handle 1234/56 as 1234%2F56. Since DSpace already breaks this rule, I
didn't deem including a colon as such a great crime ;) Of course, it
would be better if we could use an identifier scheme that didn't require
escaped characters, but most will at least have a / to separate prefix
from suffix. If we're going to be strict, I think I'd favour the
following form:

http://dspace.me.ac.uk/uri/hdl%3A1234%2F56

or maybe

http://dspace.me.ac.uk/uri/hdl/1234%2F56

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as HP CONFIDENTIAL.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-29 Thread Mark Diggory
Hey Folks,

PI resolvers come in all shapes and sizes, What all your talking  
about implementing is proxy/resolution. I would highly recommend NOT  
conflating the PI resolution mechanism (and why do we even have to  
have one) with the url path with which a Community, Collection, Item  
or Bitstream is referenced under in DSpace. What this means is that  
you do not have a url on with you have to worry about the identifier  
being properly escaped. You also only have to be concerned with  
resolving one path to the Item for any PI system.

I.E.

hdl:1234/5 -- http://dspace.me.ac.uk/item/ABCD

and also

doi:6789/0 -- http://dspace.me.ac.uk/item/ABCD

Don't conflate local and global identification.

Cheers,
Mark Diggory

On May 29, 2007, at 7:52 AM, James Rutherford wrote:

 On Tue, May 29, 2007 at 11:56:58AM +0100, Graham Triggs wrote:
 On Tue, 2007-05-29 at 11:43 +0100, James Rutherford wrote:
 I don't see what's so unusual or undesirable about colons. The  
 reasoning
 behind doing it this way was so that the value after /uri/ is the
 canonical form of the identifier.

 The colon is a reserved character, and in this example would have  
 to be
 encoded to be strictly valid according to the specifications - which
 would then mean it isn't the canonical form.

 Well if we're going to be strict, we should escape the value of the
 handle 1234/56 as 1234%2F56. Since DSpace already breaks this rule, I
 didn't deem including a colon as such a great crime ;) Of course, it
 would be better if we could use an identifier scheme that didn't  
 require
 escaped characters, but most will at least have a / to separate  
 prefix
 from suffix. If we're going to be strict, I think I'd favour the
 following form:

 http://dspace.me.ac.uk/uri/hdl%3A1234%2F56

 or maybe

 http://dspace.me.ac.uk/uri/hdl/1234%2F56

 Jim

 -- 
 James Rutherford  |  Hewlett-Packard Limited registered  
 Office:
 Research Engineer |  Cain Road,
 HP Labs   |  Bracknell,
 Bristol, UK   |  Berks
 +44 117 312 7066  |  RG12 1HN.
 [EMAIL PROTECTED]   |  Registered No: 690597 England

 The contents of this message and any attachments to it are  
 confidential and
 may be legally privileged. If you have received this message in  
 error, you
 should delete it from your system immediately and advise the  
 sender. To any
 recipient of this message within HP, unless otherwise stated you  
 should
 consider this message and attachments as HP CONFIDENTIAL.

 -- 
 ---
 This SF.net email is sponsored by DB2 Express
 Download DB2 Express C - the FREE version of DB2 express and take
 control of your XML. No limits. Just data. Click to get it now.
 http://sourceforge.net/powerbar/db2/
 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech

~
Mark R. Diggory - DSpace Systems Manager
MIT Libraries, Systems and Technology Services
Massachusetts Institute of Technology



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-29 Thread Graham Triggs
On Tue, 2007-05-29 at 12:52 +0100, James Rutherford wrote:
 Well if we're going to be strict, we should escape the value of the
 handle 1234/56 as 1234%2F56. Since DSpace already breaks this rule, I
 didn't deem including a colon as such a great crime ;)

Fair point, and you are probably right. But there is strict and there is
strict... and it isn't entirely clear that the handle should be treated
as a complete unit rather than the separation of prefix and suffix -
globally, that's how they need to be referred to, but then we're
discussing local urls here ;-)

Yes an unescaped slash isn't going to do anything harmful. An unescaped
colon in the middle of the url could easily trigger url parsing bugs and
security problems.

G
This email has been scanned by Postini.
For more information please visit http://www.postini.com


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-29 Thread James Rutherford
On Tue, May 29, 2007 at 08:21:47AM -0400, Mark Diggory wrote:
 PI resolvers come in all shapes and sizes, What all your talking  
 about implementing is proxy/resolution. I would highly recommend NOT  
 conflating the PI resolution mechanism (and why do we even have to  
 have one) with the url path with which a Community, Collection, Item  
 or Bitstream is referenced under in DSpace. What this means is that  
 you do not have a url on with you have to worry about the identifier  
 being properly escaped. You also only have to be concerned with  
 resolving one path to the Item for any PI system.
 I.E.
 
 hdl:1234/5 -- http://dspace.me.ac.uk/item/ABCD
 
 and also
 
 doi:6789/0 -- http://dspace.me.ac.uk/item/ABCD

OK, this is fine, but we'll need to define the form that we want for the
URL. If we don't use the canonical form of persistent identifiers for
this, then we'll need to use another identifier that is unique across
the site (presumably, something based on the database id of the object).
Using UUIDs (as suggested earlier) would *work*, but would produce
horrid URLs.

cheers,

Jim

-- 
James Rutherford  |  Hewlett-Packard Limited registered Office:
Research Engineer |  Cain Road,
HP Labs   |  Bracknell,
Bristol, UK   |  Berks
+44 117 312 7066  |  RG12 1HN.
[EMAIL PROTECTED]   |  Registered No: 690597 England

The contents of this message and any attachments to it are confidential and
may be legally privileged. If you have received this message in error, you
should delete it from your system immediately and advise the sender. To any
recipient of this message within HP, unless otherwise stated you should
consider this message and attachments as HP CONFIDENTIAL.

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-25 Thread Graham Triggs
Hi,

 1) Why would an institution use more than one PI
 system?  How do you determine which PI system generates a PId (base it
 on collection, community)?

There are a lot of theoretical reasons why multiple PI schemes may be in 
use. Even if you have the simple case of an institute / repository defining 
a single PI scheme that it always uses for the contents of the repository, 
depening on what content is being added, there may already be other PIs 
associated with an item that is being deposited (for example, a published 
article may have a DOI).

Beyond that, you may have repositories that have mandated different PI 
schemes being merged, and therefore all those existing PIs need to be 
supported, as well as new ones for the final repository possibly having to 
be assigned.

And with all the issues surrounding 'ownership' and encouraging the use of 
the repository, it may well prove necessary to support (and mandate) 
different PI schemes on a community or collection level.

 2)  It is mentioned that HTTP isn't persistent:  Could someone explain
 why HTTP isn't as persistent as any other protocol?

Forget to pay your domain registration fee on time and see how persistent it 
is ;-)

Potentially more problematic, what happens when part (or all) of a 
repository is migrated into another? Can the domain be transferred to the 
'new' location? If not, can URL forwarding be set up on the old URLs?

HTTP can provide a unique identifier for an object at a given point in time, 
but it isn't necessarily going to be possible to rely on it always resolving 
to the same object over it's entire lifetime.

 3) Including special characters in the URL string doesn't seem like a
 good idea.  While they are valid characters, it does take extra
 processing to encode/decode them from layer to layer.

Totally agreed - having colons, etc. in the url is going to lead to problems 
in some circumstances.

 4) Assigning bitstreams persistent identifiers seems dangerous.  At the
 very least, version control and a history function are required by the
 application and PI system to determine if the PId is actually pointing
 to what was requested.  Also, how are multiple bitstreams handled when
 assigned to an item?  Does each bitstream get a PId?  How does a user
 look at all bitstreams associated together by the item when the PId
 references only a single bitstream?

We had a fair amount of discussion about these issues during the 
architectural review last year - which were largely centered around 
extensions to the existing mechanism in order to reference specific (or 
simply the latest) version of a bitstream as relative to the item.

Whether there is a need to assign an 'actual' PI to individual bitstreams or 
not is very much a policy decision of the repository. Assigning a PI to an 
individual bitstream does not mean that it happens in lieu of assigning one 
to the item itself - so if you want to look at other bitstreams associated 
to the same item, you should use the item PI (and if a user has only been 
given a PI for a specific bitstream, then they could potentially search for 
the item that refers to the bitstream identified by that PI).

As for versioning, again it's a bit of a policy decision, but a PI could be 
assigned to a specific revision (and therefore a new revision would get a 
new PI). You could also have a 'special' PI that would always refer to the 
latest revision.

 As far as having a default PI system out of the box for Dspace, I would
 recommend using a local identifier schema which used the existing URLs.
 Include the Handle PI system in the release as a configurable option,
 but not turned on by default.  This would remove the fake handle being
 assigned to all objects and clean up the default URLs out of the box.

Well, now to be controversial. IMHO, too much importance is being focused on 
PIs. Yes, PIs are important for preservation, but that doesn't mean that 
they have to be treated as something specific and central to DSpace.

PIs are 'just' metadata. and supporting multiple ways to resolve a piece (or 
a combination of pieces) of metadata to an asset - or simplying presenting 
them in display - isn't really that hard.

Now there are special concerns about the handling - ensuring it's presence, 
automatic generation/assignment, ensuring uniqueness (probably) - but that's 
all just a question of providing better workflows and metadata handling. In 
other words, any concerns that we have about how we handle persitent 
identifiers could be applicable to any piece (or combination) of metadata - 
and by that token, solving those issues for all metadata would resolve the 
issues for PIs, just be treating them as 'only' metadata.

This would mean that the only id we need to centrally worry about assigning 
to an asset is a unique id to be resolvable within the repository - ie. a 
UUID, which would likely be unique across all DSpace instances, and as such 
could be maintained across migrating from one 

Re: [Dspace-tech] Persistent identifiers in DSpace -- thoughtsplease

2007-05-25 Thread Mark Diggory

On May 25, 2007, at 6:35 PM, Graham Triggs wrote:

 Hi,

 1) Why would an institution use more than one PI
 system?  How do you determine which PI system generates a PId  
 (base it
 on collection, community)?

 There are a lot of theoretical reasons why multiple PI schemes may  
 be in
 use. Even if you have the simple case of an institute / repository  
 defining
 a single PI scheme that it always uses for the contents of the  
 repository,
 depening on what content is being added, there may already be other  
 PIs
 associated with an item that is being deposited (for example, a  
 published
 article may have a DOI).

 Beyond that, you may have repositories that have mandated different PI
 schemes being merged, and therefore all those existing PIs need to be
 supported, as well as new ones for the final repository possibly  
 having to
 be assigned.

 And with all the issues surrounding 'ownership' and encouraging the  
 use of
 the repository, it may well prove necessary to support (and mandate)
 different PI schemes on a community or collection level.

 2)  It is mentioned that HTTP isn't persistent:  Could someone  
 explain
 why HTTP isn't as persistent as any other protocol?

 Forget to pay your domain registration fee on time and see how  
 persistent it
 is ;-)

 Potentially more problematic, what happens when part (or all) of a
 repository is migrated into another? Can the domain be transferred  
 to the
 'new' location? If not, can URL forwarding be set up on the old URLs?

 HTTP can provide a unique identifier for an object at a given point  
 in time,
 but it isn't necessarily going to be possible to rely on it always  
 resolving
 to the same object over it's entire lifetime.

But thats like comparing apples to apple pickers. Forget  
resolution, an HTTP url is just as much a URI as a Handle or DOI  
is.  If CNRI's global registration and resolving proxy service  
disappears. What becomes of all the existing handles? Yes, they could  
possibly be considered persistent, but worth little more than  
unresolvable strings until a comparable resolution system is  
reestablished.


 3) Including special characters in the URL string doesn't seem like a
 good idea.  While they are valid characters, it does take extra
 processing to encode/decode them from layer to layer.

 Totally agreed - having colons, etc. in the url is going to lead to  
 problems
 in some circumstances.

Agreed, for DSpace identifiers, keep them simple for maximal  
portability into other naming systems.


 4) Assigning bitstreams persistent identifiers seems dangerous.   
 At the
 very least, version control and a history function are required by  
 the
 application and PI system to determine if the PId is actually  
 pointing
 to what was requested.  Also, how are multiple bitstreams handled  
 when
 assigned to an item?  Does each bitstream get a PId?  How does a user
 look at all bitstreams associated together by the item when the PId
 references only a single bitstream?

 We had a fair amount of discussion about these issues during the
 architectural review last year - which were largely centered around
 extensions to the existing mechanism in order to reference specific  
 (or
 simply the latest) version of a bitstream as relative to the item.

 Whether there is a need to assign an 'actual' PI to individual  
 bitstreams or
 not is very much a policy decision of the repository. Assigning a  
 PI to an
 individual bitstream does not mean that it happens in lieu of  
 assigning one
 to the item itself - so if you want to look at other bitstreams  
 associated
 to the same item, you should use the item PI (and if a user has  
 only been
 given a PI for a specific bitstream, then they could potentially  
 search for
 the item that refers to the bitstream identified by that PI).

 As for versioning, again it's a bit of a policy decision, but a PI  
 could be
 assigned to a specific revision (and therefore a new revision would  
 get a
 new PI). You could also have a 'special' PI that would always refer  
 to the
 latest revision.

As long as the any PI or Bitstream part of an Item PI is  
controllable and reassignable. For an instance of what not to do, do  
not take the current sequence id and tack it onto the Item id such  
that the replacement of a bitstream (because of ingest error or other  
policy) cannot have the appropriate identifier remapped to it. In  
DSpace sequence ids can only be assigned to one bitstream, removing  
that bitstream and adding another results in a new sequence ID. (But  
actually, this is mostly moot once versioning of Items is introduced).

 As far as having a default PI system out of the box for Dspace, I  
 would
 recommend using a local identifier schema which used the existing  
 URLs.
 Include the Handle PI system in the release as a configurable option,
 but not turned on by default.  This would remove the fake handle  
 being
 assigned to all objects and clean up the default URLs out of