Alan Ruttenberg wrote:
On Wed, Jan 19, 2011 at 4:45 PM, Nathan <nat...@webr3.org> wrote:
David Wood wrote:
On Jan 19, 2011, at 10:59, Nathan wrote:

ps: as an illustration of how engrained URI normalization is, I've
capitalized the domain names in the to: and cc: fields, I do hope the mail
still come through, and hope that you'll accept this email as being sent to
you. Hopefully we'll also find this mail in the archives shortly at
htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd
hope that any statements made using these URIs (asserted by man or machine)
would remain valid regardless of the (incorrect?-)casing.

Heh.  OK, I'll bite.  Domain names in email addressing are defined in IETF
RFC 2822 (and its predecessor RFC 822), which defers the interpretation to
RFC 1035 ("Domain names - implementation and specification).  RFC 1035
section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are
to be compared in a case-insensitive manner.

As far as I know, the W3C specs do not so refer to RFC 1035.

And I'll bite in the other direction, why not treat URIs as URIs? why go
against both the RDF Specification [1] and the URI specification when they
say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force
case-sensitive matching on the scheme and domain on URIs matching the
generic syntax when the specs say must be compared case insensitively? and
so on and so forth.

[AR]
Which specs?

The various URI/IRI specs and previous revisions of.

http://www.w3.org/TR/REC-xml-names/#NSNameComparison

"URI references identifying namespaces
..
In a namespace declaration, the URI reference is
..
The URI references below are all different for the purposes of identifying
namespaces
..
The URI references below are also all different for the purposes of
identifying namespaces
..
So here is another spec that *explicitly* disagrees with the idea that URI
normalization should be a built-in processing.

As far as I can see, that's only for a URI reference used within a namespace, and does not govern usage or normalization when you join the URI reference up with the local name to make the full URI.

Out of interest, where is that process defined? I was looking for it the other day - for instance in the quoted specification we have the example:

<edi:price xmlns:edi='http://ecommerce.example.org/schema' units='Euro'>32.18</edi:price>

Where's the bit of the XML specification which says you join them up by concatenating 'http://ecommerce.example.org/schema' with #(?assumed?) and 'Euro' to get 'http://ecommerce.example.org/schema#Euro'?

And finally, this is why I specifically asked if the non-normalization of RDF URI References had XML Namespace heritage, which had then filtered down through OWL, SPARQL and RIF.

[AR] More to document, please: Which data is being junked and scrapped?

will document, but essentially every statement made using a non normalized URI when other statements are also being made about the "same" resource using normalized URIs - the two most common cases for this will be when people are using "CMS" systems and enter their domain name as uppercase in some admin, only to have that filter through to URIs in serialized RDF/RDFa, and where bugs in software have led to inconsistent URIs over time (for instance where % encoding has been fixed, or a :80 has been removed from a URI).

[AR] Hmm. Are you suggesting that the behavior of libraries and clients
should have precedence over specification? My view is that one first looks
to specifications, and then only if specifications are poor or do not speak
to the issue do we look at existing behavior.

Yes I am, that specification should standardize the behaviour of libraries and clients - the level of normalization in URIs published, consumed or used by these tools is often determined by non sem web stack components, and the sem web components are blocked from normalizing these should-not-be-differing-URIs by the sem web specifications.

[AR] I think there are many ways to lose in this scenario. For instance, if
the server redirects then the base is the last in the chain of redirects.
http://tools.ietf.org/html/rfc3986#page-29, 5.1.3. Base URI from the
Retrieval URI. My conclusion - don't engineer this way.

That would be my conclusion too, but as RDF(a) moves in to the realms of the CMS systems and out of the hands of the sem web community, it will be increasingly engineered this way, it's a very common pattern when working with (X)HTML (allows people to test locally or on dev servers without changing the content).

Further, essentially all RDFa ever encountered by a browser has the casing
on all URIs in href and src, and all these which are resolved, automatically
normalized - so even if you set the base to <htTp://EXAMPLE.org/> or use it
in a URI, browser tools, extensions, and js based libraries will only ever
see the normalized URIs (and thus be incompatible with the rest of the RDF
world).

[AR] Again, I think things are worse than possible to repair, if you take
the position that you need to make it work for deployed systems. As an
example I tried the following. On my mac I created the file
/Users/alanr/Desktop/foo.html. The contents
were: <script>alert(document.location);</script>. From the command line I
tried:

open file:///Users/alanr/Desktop/foo.html  ->
alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/Foo.html ->
alert file:///Users/alanr/Desktop/Foo.html
open FILE:///Users/alanr/Desktop/foo.html ->
alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/%66oo.html
-> alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/%46oo.html
-> alert file:///Users/alanr/Desktop/Foo.html

Indeed, if you tried that in chrome/IE you'd get full normalization, in Opera you'd get something similar to above, and in firefox different again (unsure if it also differs per OS). See:

  http://webr3.org/urinorm/html

I did some testing this way yesterday and flagged it up with Adam Barth who's handling the URI/URL canonicalization and normalization for the HTML / webapps specifications yesterday [1].

The results /really/ affect RDFa Processing, see:

  http://webr3.org/urinorm/2

And as a member of the RDFa WG, focussed mainly on the API specifications, this is a real problem that needs solved - @href and @src are governed by HTML, access to the URIs within is via the DOM specifications, and the common implementations all provide normalization as standard, and as far as I can tell Adam Barth will be aligning expected normalization and canonicalization in the specifications. RDFa sitting at the intersection of this is very affected.

Note, those are not my only reasons for flagging up this normalization issue - (issue imo, time will tell if it is considered, or made, an issue by the RDF WG).

[1] http://lists.w3.org/Archives/Public/public-html-comments/2011Jan/0004.html

[AR] In this case your conjecture is shown to be partially true. The scheme
URI is made case insensitive. However the the pathname is not normalized,
and results in mistakes in the intended base, in the case of file: URLs
seems to depend on the case sensitivity of the file system on which the URL
is resolved, something a generic processor could not possibly know.

As above, you'll find increasing steps of normalization by different vendors, chrome and IE for example do "full" normalization.


Finally, I'll ask again, if anybody has any use case which benefits from <
htTp://EXAMPLE.org/%7efoo> and <http://example.org/~foo> being classed as
different RDF URIs, I'd love to hear it.

[AR] Backwards compatibility of OWL, RIF, and SPARQL.

Surely that's only an issue if somebody somewhere has data which would be negatively impacted by URIs being normalized - is there such a case? It may also be wise to consider whether people would benefit from URI normalization in regards to RDF, OWL, RIF and SPARQL - and if so surely there's a case for raising BUGs and fixing this throughout the sem web specifications.

Cheers,

Nathan

Reply via email to