Re: URI Comparisons: RFC 2616 vs. RDF

Nathan Thu, 20 Jan 2011 02:19:02 -0800

Alan Ruttenberg wrote:

On Wed, Jan 19, 2011 at 4:45 PM, Nathan <nat...@webr3.org> wrote:

David Wood wrote:

On Jan 19, 2011, at 10:59, Nathan wrote:

ps: as an illustration of how engrained URI normalization is, I've
capitalized the domain names in the to: and cc: fields, I do hope the mail
still come through, and hope that you'll accept this email as being sent to
you. Hopefully we'll also find this mail in the archives shortly at
htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd
hope that any statements made using these URIs (asserted by man or machine)
would remain valid regardless of the (incorrect?-)casing.

Heh.  OK, I'll bite.  Domain names in email addressing are defined in IETF
RFC 2822 (and its predecessor RFC 822), which defers the interpretation to
RFC 1035 ("Domain names - implementation and specification).  RFC 1035
section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are
to be compared in a case-insensitive manner.

As far as I know, the W3C specs do not so refer to RFC 1035.

And I'll bite in the other direction, why not treat URIs as URIs? why go
against both the RDF Specification [1] and the URI specification when they
say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force
case-sensitive matching on the scheme and domain on URIs matching the
generic syntax when the specs say must be compared case insensitively? and
so on and so forth.


[AR]
Which specs?


The various URI/IRI specs and previous revisions of.

http://www.w3.org/TR/REC-xml-names/#NSNameComparison

"URI references identifying namespaces

..

In a namespace declaration, the URI reference is

..

The URI references below are all different for the purposes of identifying
namespaces

..

The URI references below are also all different for the purposes of
identifying namespaces

..

So here is another spec that *explicitly* disagrees with the idea that URI
normalization should be a built-in processing.

As far as I can see, that's only for a URI reference used within anamespace, and does not govern usage or normalization when you join theURI reference up with the local name to make the full URI.

Out of interest, where is that process defined? I was looking for it theother day - for instance in the quoted specification we have the example:

<edi:price xmlns:edi='http://ecommerce.example.org/schema'units='Euro'>32.18</edi:price>

Where's the bit of the XML specification which says you join them up byconcatenating 'http://ecommerce.example.org/schema' with #(?assumed?)and 'Euro' to get 'http://ecommerce.example.org/schema#Euro'?

And finally, this is why I specifically asked if the non-normalizationof RDF URI References had XML Namespace heritage, which had thenfiltered down through OWL, SPARQL and RIF.

[AR] More to document, please: Which data is being junked and scrapped?

will document, but essentially every statement made using a nonnormalized URI when other statements are also being made about the"same" resource using normalized URIs - the two most common cases forthis will be when people are using "CMS" systems and enter their domainname as uppercase in some admin, only to have that filter through toURIs in serialized RDF/RDFa, and where bugs in software have led toinconsistent URIs over time (for instance where % encoding has beenfixed, or a :80 has been removed from a URI).

[AR] Hmm. Are you suggesting that the behavior of libraries and clients
should have precedence over specification? My view is that one first looks
to specifications, and then only if specifications are poor or do not speak
to the issue do we look at existing behavior.

Yes I am, that specification should standardize the behaviour oflibraries and clients - the level of normalization in URIs published,consumed or used by these tools is often determined by non sem web stackcomponents, and the sem web components are blocked from normalizingthese should-not-be-differing-URIs by the sem web specifications.

[AR] I think there are many ways to lose in this scenario. For instance, if
the server redirects then the base is the last in the chain of redirects.
http://tools.ietf.org/html/rfc3986#page-29, 5.1.3. Base URI from the
Retrieval URI. My conclusion - don't engineer this way.

That would be my conclusion too, but as RDF(a) moves in to the realms ofthe CMS systems and out of the hands of the sem web community, it willbe increasingly engineered this way, it's a very common pattern whenworking with (X)HTML (allows people to test locally or on dev serverswithout changing the content).

Further, essentially all RDFa ever encountered by a browser has the casing
on all URIs in href and src, and all these which are resolved, automatically
normalized - so even if you set the base to <htTp://EXAMPLE.org/> or use it
in a URI, browser tools, extensions, and js based libraries will only ever
see the normalized URIs (and thus be incompatible with the rest of the RDF
world).

[AR] Again, I think things are worse than possible to repair, if you take
the position that you need to make it work for deployed systems. As an
example I tried the following. On my mac I created the file
/Users/alanr/Desktop/foo.html. The contents
were: <script>alert(document.location);</script>. From the command line I
tried:

open file:///Users/alanr/Desktop/foo.html  ->
alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/Foo.html ->
alert file:///Users/alanr/Desktop/Foo.html
open FILE:///Users/alanr/Desktop/foo.html ->
alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/%66oo.html
-> alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/%46oo.html
-> alert file:///Users/alanr/Desktop/Foo.html

Indeed, if you tried that in chrome/IE you'd get full normalization, inOpera you'd get something similar to above, and in firefox differentagain (unsure if it also differs per OS). See:


  http://webr3.org/urinorm/html

I did some testing this way yesterday and flagged it up with Adam Barthwho's handling the URI/URL canonicalization and normalization for theHTML / webapps specifications yesterday [1].


The results /really/ affect RDFa Processing, see:

  http://webr3.org/urinorm/2

And as a member of the RDFa WG, focussed mainly on the APIspecifications, this is a real problem that needs solved - @href and@src are governed by HTML, access to the URIs within is via the DOMspecifications, and the common implementations all provide normalizationas standard, and as far as I can tell Adam Barth will be aligningexpected normalization and canonicalization in the specifications. RDFasitting at the intersection of this is very affected.

Note, those are not my only reasons for flagging up this normalizationissue - (issue imo, time will tell if it is considered, or made, anissue by the RDF WG).

[1]http://lists.w3.org/Archives/Public/public-html-comments/2011Jan/0004.html

[AR] In this case your conjecture is shown to be partially true. The scheme
URI is made case insensitive. However the the pathname is not normalized,
and results in mistakes in the intended base, in the case of file: URLs
seems to depend on the case sensitivity of the file system on which the URL
is resolved, something a generic processor could not possibly know.

As above, you'll find increasing steps of normalization by differentvendors, chrome and IE for example do "full" normalization.

Finally, I'll ask again, if anybody has any use case which benefits from <
htTp://EXAMPLE.org/%7efoo> and <http://example.org/~foo> being classed as
different RDF URIs, I'd love to hear it.

[AR] Backwards compatibility of OWL, RIF, and SPARQL.

Surely that's only an issue if somebody somewhere has data which wouldbe negatively impacted by URIs being normalized - is there such a case?It may also be wise to consider whether people would benefit from URInormalization in regards to RDF, OWL, RIF and SPARQL - and if so surelythere's a case for raising BUGs and fixing this throughout the sem webspecifications.


Cheers,

Nathan

Re: URI Comparisons: RFC 2616 vs. RDF

Reply via email to