Re: [Dbpedia-discussion] URIs with "<" in them confusing Virtuoso and Jena

Ted Thibodeau Jr Mon, 25 Aug 2008 18:40:06 -0700

Hi, Chris --

* Chris Schumacher [2008/08/25 03:44 PM -0700] wrote:
> This is similar to the recent ampersand issue.
>
> An old URI RFC [1] (section 2.4.3) states that angle brackets are
> illegal in URIs, but the current spec [2] and RFC 3986 [3] seem to
> allow them(!).  The dbpedia3.1 externallinks_en.nt file has several
> URIs with "<" which is leading to confusion for both Virtuoso and
> Jena.
>
> For example, at <http://dbpedia.org/snorql/> the following query
> will confuse virtuoso:
>
>    SELECT * WHERE {
>    <http://www.sample.com<dogs> ?p ?o
>    }
>
> remain in light,
> cws
>
> [1] <http://www.faqs.org/rfcs/rfc2396.html>
> [2] <http://www.w3.org/Addressing/URL/uri-spec.html>
> [3] <http://www.ietf.org/rfc/rfc3986.txt>

Well...

First thing...

I've just dug into the file in question, and there are 8 URIs
causing this sort of trouble, all in the ?o position, each in
a single triple.

   <http://www.youtube.com/watch?v=vgKWDwRw_DE<!--> .

<http://www.nytimes.com/2008/03/14/business/media/14adco.html?_r=1&oref=slogin<br>
.

<http://links.jstor.org/sici?sici=0891-3609%28192712%2F192801%2923%3A117%3C19%3AASOTEV%3E2.0.CO%3B2-4&size=LARGE&origin=JSTOR-enlargePage<!-->
.

<http://links.jstor.org/sici?sici=0891-3609%28192912%2925%3A130%3C24%3ATSCOCS%3E2.0.CO%3B2-T<!-->
.

   <http://www.youtube.com/watch?v=uVcje0t3-nE<YouTube> .

<http://www.royalsportal.de/forum/index.php?showtopic=22787&hl=Hesse<!--> .

<http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=text&uid=6345791&dopt=Abstract<!-->
.

<http://www.up-rs.si/up-rs/uprs.nsf/dokumentiweb/46F76AD2AEA5E1A6C125746A004692FA?OpenDocument<!-->
.

It looks to me like, although these URIs are all valid, they won't
have the effect one might expect, when dereferenced.  The snippets
from the enclosed left-angle-bracket to the end of the URI are (to
me) clearly erroneous -- 6 of them are the comment-starting "<!--";
one is a "<br", and the last is the literal text, "<YouTube".

These are errors, and should be tidied up in the sources.

That said -- while this has some surface resemblance to the ampersand
issue, in that resolution is found by reading RFCs, including the RFC
3986 you cite --

   [1] <http://www.rfc.net/rfc3986.html>.

-- the sample URI in your query is broken, and I would expect it to
get an error back (so there does seem to be some error handling to
be added to some tools).

I was writing a followup to Richard's assertion that "Ampersands
are allowed in URIs", but as usual, a proper followup is rather
full of details and detours, so I wasn't done yet.

But clearly, it's needed now, so I'll include it here in its
current state, modified somewhat.

In the current case, your sample URI --

   <http://www.sample.com<dogs>

 -- is invalid, not simply because it contains the left angle bracket
"<" -- which *is* permitted in a general sense -- but because of
*where* that character is found.

This is the key piece of the URI syntax which your sample breaks --

   [2] <http://www.rfc.net/rfc3986.html#s3.2.>

   The authority component is preceded by a double slash ("//")
   and is terminated by the next slash ("/"), question mark ("?"),
   or number sign ("#") character, or by the end of the URI.

      authority   = [ userinfo "@" ] host [ ":" port ]

As there is no "@" and no ":", host is the sub-segment that matters.
The following further analysis comes from Appendix A, the Collected
ABNF for URI --

   [3] <http://www.rfc.net/rfc3986.html#sA.>

   host          = IP-literal / IPv4address / reg-name

I think we can agree that the host value is neither an IP-literal
nor an IPv4address; so it must be reg-name.

   reg-name      = *( unreserved / pct-encoded / sub-delims )

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

   pct-encoded   = "%" HEXDIG HEXDIG

   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

As the left-angle-bracket is not specifically included in either
the unreserved or sub-delims character sets, it *must* be percent-
encoded to be included in this segment.

Interestingly, Appendix C of RFC 3986 includes most of what was
in RFC 2396 Sec 2.4.3 -- but there is nothing in the current RFC
which makes angle-brackets or double-quotes illegal in URIs, and
thus I have to question whether this section remains accurate --

   [4] <http://www.rfc.net/rfc3986.html#sD.>

   [...] In such cases, it is important to be able to delimit
   the URI from the rest of the text, and in particular from
   punctuation marks that might be mistaken for part of the URI.

   In practice, URIs are delimited in a variety of ways, but
   usually within double-quotes "http://example.com/";, angle
   brackets <http://example.com/>, or just by using whitespace:

      http://example.com/

   These wrappers do not form part of the URI.

I'm left wondering whether the omission of angle-brackets from
the reserved list was intentional or accidental.

That said, to what I was already writing --

* Richard Cyganiak [2008/08/20 09:29 AM +0100] wrote:
> Ampersands are allowed in URIs, so the Yago URIs are perfectly
> fine according to all the specs. (We *might* still want to
> %-encode the ampersand in those URIs, but just for consistency
> with our other URIs, not because the specs require it. That's
> a separate question.)

Absolute statements can be dangerous.  On lists like these,
statements such as the above can become quoted authority,
even when incorrect ... as now.

Ampersands are allowed in *some* components of *some* URIs, and
those *do* include the Yago URIs, so far as I can tell.

  [5] <http://www.rfc.net/rfc3986.html#s2.2.>

   2.2. Reserved Characters

   URIs include components and subcomponents that are delimited
   by characters in the "reserved" set.  These characters are
   called "reserved" because they may (or may not) be defined as
   delimiters by the generic syntax, by each scheme-specific
   syntax, or by the implementation-specific syntax of a URI's
   dereferencing algorithm.  If data for a URI component would
   conflict with a reserved character's purpose as a delimiter,
   then the conflicting data must be percent-encoded before the
   URI is formed.

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   The purpose of reserved characters is to provide a set of
   delimiting characters that are distinguishable from other
   data within a URI. URIs that differ in the replacement of
   a reserved character with its corresponding percent-encoded
   octet are not equivalent.  Percent-encoding a reserved
   character, or decoding a percent-encoded octet that
   corresponds to a reserved character, will change how the
   URI is interpreted by most applications.  Thus, characters
   in the reserved set are protected from normalization and
   are therefore safe to be used by scheme-specific and
   producer-specific algorithms for delimiting data
   subcomponents within a URI.

   A subset of the reserved characters (gen-delims) is used
   as delimiters of the generic URI components described in
   Section 3.  A component's ABNF syntax rule will not use
   the reserved or gen-delims rule names directly; instead,
   each syntax rule lists the characters allowed within that
   component (i.e., not delimiting it), and any of those
   characters that are also in the reserved set are "reserved"
   for use as subcomponent delimiters within the component.
   Only the most common subcomponents are defined by this
   specification; other subcomponents may be defined by a URI
   scheme's specification, or by the implementation-specific
   syntax of a URI's dereferencing algorithm, provided that
   such subcomponents are delimited by characters in the
   reserved set allowed within that component.

   URI producing applications should percent-encode data octets
   that correspond to characters in the reserved set unless
   these characters are specifically allowed by the URI scheme
   to represent data in that component.  If a reserved character
   is found in a URI component and no delimiting role is known
   for that character, then it must be interpreted as
   representing the data octet corresponding to that character's
   encoding in US-ASCII.

Note that the ampersand is included in the "sub-delims" portion
of the "reserved" set.  Note that the reserved status of "&" in
HTTP URIs has *changed* as the HTTP URI scheme RFC has evolved --
but these changes have not always been properly documented!

Appendix D of RFC 3986 [6] <http://www.rfc.net/rfc3986.html#sD.>
is supposed to show "Changes from RFC 2396" -- but it left out
the following, which is key here.

The current RFC shows --

   [7] <http://www.rfc.net/rfc3986.html#s3.4.>

   3.4. Query

   The query component contains non-hierarchical data that,
   along with data in the path component (Section 3.3), serves
   to identify a resource within the scope of the URI's scheme
   and naming authority (if any).  The query component is
   indicated by the first question mark ("?") character and
   terminated by a number sign ("#") character or by the end
   of the URI.

      query       = *( pchar / "/" / "?" )

   The characters slash ("/") and question mark ("?") may
   represent data within the query component.  Beware that
   some older, erroneous implementations may not handle such
   data correctly when it is used as the base URI for relative
   references (Section 5.1), apparently because they fail to
   distinguish query data from path data when looking for
   hierarchical separators.  However, as query components are
   often used to carry identifying information in the form of
   "key=value" pairs and one frequently used value is a
   reference to another URI, it is sometimes better for
   usability to avoid percent-encoding those characters.

-- while the RFC it obsoleted shows --

   [8] <http://www.rfc.net/rfc2396.html#s3.4.>

   3.4. Query Component

   The query component is a string of information to be
   interpreted by the resource.

      query         = *uric

   Within a query component, the characters ";", "/", "?",
   ":", "@", "&", "=", "+", ",", and "$" are reserved.

Appendix D *does* say --

   Section 2, on characters, has been rewritten to explain
   what characters are reserved, when they are reserved,
   and why they are reserved, even when they are not used
   as delimiters by the generic syntax.  The mark characters
   that are typically unsafe to decode, including the
   exclamation mark ("!"), asterisk ("*"), single-quote ("'"),
   and open and close parentheses ("(" and ")"), have been
   moved to the reserved set in order to clarify the distinction
   between reserved and unreserved and, hopefully, to answer
   the most common question of scheme designers.  Likewise, the
   section on percent-encoded characters has been rewritten,
   and URI normalizers are now given license to decode any
   percent-encoded octets corresponding to unreserved characters.
   In general, the terms"escaped" and "unescaped" have been
   replaced with "percent-encoded" and "decoded", respectively,
   to reduce confusion with other forms of escape mechanisms.

-- but there's no discussion of the substantial changes to
Section 3.4, which I flagged above...

So, while it should be clear that the ampersand has historically
been reserved in *part* of the HTTP URI, it seems that this is no
longer true -- but older implementations and authors who learned
from the older RFC may well still treat it so -- and may well
percent-encode it in components of the URI other than the Query
Component, for a variety of reasons (not least being simple
confusion as to when percent-encoding is required and when not --
this aspect of the spec remains rather unclear to the average
reader, thanks to the utter lack of examples covering such tricky
scenarios as an ampersand in the "path-absolute").

Maybe we need some URI validation tools, to start with...

Be seeing you,

Ted

-- 
A: Yes.                      http://www.guckes.net/faq/attribution.html
| Q: Are you sure?
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?

Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
Evangelism & Support         //        mailto:[EMAIL PROTECTED]
OpenLink Software, Inc.      //              http://www.openlinksw.com/
                                 http://www.openlinksw.com/weblogs/uda/
OpenLink Blogs              http://www.openlinksw.com/weblogs/virtuoso/
                               http://www.openlinksw.com/blog/~kidehen/
    Universal Data Access and Virtual Database Technology Providers

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Dbpedia-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] URIs with "<" in them confusing Virtuoso and Jena

Reply via email to