Re: URI canonicalization

Antone Roundy Tue, 01 Feb 2005 14:49:52 -0800

On Monday, January 31, 2005, at 10:57 PM, Roy T. Fielding wrote:

There is no reason to require any particular comparison algorithm.
One application is going to compare them the same way every time.
Two different applications may reach different conclusions about
two equivalent identifiers, but nobody cares because AT WORST the
result is a bit of inefficient use of storage.

The guidance, if any, should simply state that identifier
constructs must be unique.  It is not our responsibility to
prevent people from assigning the same (equivalent) identifiers
to two different resources, nor do I care how many errors
occur when they violate such a basic requirement.

While we certainly shouldn't be in the business of saying "IDs MUST be compared using the C function 'strcmp'", I think we need to be specific about what it means for an IDs to be unique. Given that we're using URIs as IDs, and two URIs can be functionally identical without being characterwise identical, we need to explain that we mean by "unique". If we don't want to mandate a particular algorithm, then we need to find a way to get the point across some other way. Perhaps (the first sentence is the existing first sentence):

   Instances of Identity constructs can be compared to determine whether
   an entry or feed is the same as one seen before.  The values of two
   Identity constructs are considered to be the same if a case-sensitive
   character-by-character comparison would recognize them as identical.

This language doesn't mandate actually performing a "case-sensitive character-by-character comparison", but it should be clear from the this language what it means to be the same. If we want to get even further from talking about algorithms, we could go with something like this, but it begins to sound a little strained:

Instances of Identity constructs can be compared to determine whether an entry or feed is the same as one seen before. The values of two Identity constructs are considered to be the same if they contain exactly the same code points in exactly the same order.

If we want to get even more precise, and don't mind getting more wordy:

   Instances of Identity constructs can be compared to determine whether
   an entry or feed is the same as one seen before.  The values of two
   Identity constructs are considered to be the same if, after XML
   deserialization, but before any other processing such as decoding of
   percent-encoded characters, they contain exactly the same code points
   in exactly the same order.

Re: URI canonicalization

Reply via email to