Hi Dave,

Generally I agree, will address a few specific points in line (just to address them) then summarize my intended goals at the end (being the substance of the mail).

Dave Reynolds wrote:
The URI spec (rfc3986[1]) does allow this usage. In particular Section 6
Normalization and Comparison says:

"""URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will
   often be subject to differing design trade-offs in regards to how
   much effort should be spent in reducing aliased identifiers.  This
   section describes various methods that may be used to compare URIs,
   the trade-offs between them, and the types of applications that might
   use them."""

and

"""We use the terms "different" and
   "equivalent" to describe the possible outcomes of such comparisons,
   but there are many application-dependent versions of equivalence."""

While RDF predates this spec it seems to me that the RDF usage remains
consistent with it. The purpose of comparison in RDF is different from
that of cache retrieval of web pages or message delivery of email.

Indeed, I also read though:

   For all URIs, the hexadecimal digits within a percent-encoding
   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
   should be normalized to use uppercase letters for the digits A-F.

   When a URI uses components of the generic syntax, the component
   syntax equivalence rules always apply; namely, that the scheme and
   host are case-insensitive and therefore should be normalized to
   lowercase...
   - http://tools.ietf.org/html/rfc3986#section-6.2.2.1

And took the "For all" and "always" to literally mean "for all" and "always".

Unsure where this leaves things, and which takes precedence.

This quote also makes clear that there is no single definitive
normalization. There are different levels of normalization possible
depending on your needs.

agree

So I claim that in terms of formal published specifications:
(1) RDF, OWL and RIF do not require any normalization of URIs (beyond
the character encoding level) and compare URIs by simple string
comparison.

One potential issue on the % encoding, clarified further down.

(2) This usage is *not* precluded by the URI specs, at least by 3986
which sets the current framework for the application of scheme-specific
specs.

Not a 100% sure but tempted to agree with you, would make sense not to preclude it.

As we've already mentioned :) there are no specs for linked data so we
move onto more subjective grounds.

Would be nice to get some specs at some point...

The linked data convention is that dereferencing some URI U in your RDF
document should return information about U, including further onward
links. So if data set A spells a URI hTTp://example.com/foo but the data
you get from dereferencing that URI talks only about
http://example.com/foo then someone has a problem somewhere. The
question is who, where and how to fix it.

agree, good way of putting it.

against both the RDF Specification [1] and the URI specification when they say /not/ to encode permitted US-ASCII characters (like ~ %7E)?

Where did that example come from?

   The encoding consists of... %-escaping octets that do not correspond
   to permitted US-ASCII characters.
   - http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref

   For consistency, percent-encoded octets in the ranges of ALPHA
   (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E),
   underscore (%5F), or tilde (%7E) should not be created by URI
   producers and, when found in a URI, should be decoded to their
   corresponding unreserved characters by URI normalizers.
   - http://tools.ietf.org/html/rfc3986#section-2.3

I read those quotes as saying do not encode permitted US-ASCII characters in RDF URI References.

At what point have we suggested doing that?

As above

why force case-sensitive matching on the scheme and domain on URIs matching the generic syntax when the specs say must be compared case insensitively?

No, the specs do not say that, see above.

See "for all" and "always" quote earlier on.

So use normalized URIs in the first place.
...
RDF/OWL/RIF aren't designed the way they are because someone thought it
would be a good idea to allow such things to be used side by side or
because they *want* people to use denormalized URIs.
...
The point is that there is no single, simple, universal (i.e. across all
schemes) normalization algorithm that could be used.
The current approach gives stable, well-defined behaviour which doesn't
change as people invent new URI schemes. The RDF serializations give you
enough control to enable you to be certain about what URI you are
talking about. Job done.

Okay, I agree, and I'm really not looking to create a lot of work here, the general gist of what I'm hoping for is along the lines of:

RDF Publishers MUST perform Case Normalization and Percent-Encoding Normalization on all URIs prior to publishing. When using relative URIs publishers SHOULD include a well defined base using a serialization specific mechanism. Publishers are advised to perform additional normalization steps as specified by URI (RFC 3986) where possible.

RDF Consumers MAY normalize URIs they encounter and SHOULD perform Case Normalization and Percent-Encoding Normalization.

Two RDF URIs are equal if and only if they compare as equal, character by character, as Unicode strings.

For many reasons it would be good to solve this at the publishing phase, allow normalization at the consuming phase (can't be precluded as intermediary components may normalize), and keep simple case sensitive string comparison throughout the stack and specs (so implementations remain simple and fast.)

Does anybody find the above disagreeable?

Best, and cheers for the reply Dave,

Nathan

Reply via email to