mod_perl guide corrections: & in uris

Marc Lehmann Sun, 11 Feb 2001 17:40:19 -0800
Stas told me to forward my mail to the list, since there was a large
discussion about it. Since I now see that this seems to have been a kind
of dispute and not an ommision I'll provide references to the standards
below.

----- Forwarded message from Marc Lehmann <[EMAIL PROTECTED]> -----

Subject: mod_perl guie corrections
From: Marc Lehmann <[EMAIL PROTECTED]>
Date: Sun, 11 Feb 2001 20:24:59 +0100
To: [EMAIL PROTECTED]

in http://perl.apache.org/guide/browserbugs.html I read:

   Preventing QUERY_STRING from getting corrupted because of &entity key
   names:

   http://my.site.com/foo.pl?foo=bar&reg=foobar, then some browsers will
   interpret &reg as an SGML entity

This claims this is a browser bug, which it isn't. Browsers are perfectly
fine to interpret the &reg as an entity when you embed this in the
html source unquoted. What's wrong is feeding non-html code to the
browser in the first place. But as we all know browsers always try to
decipher html-like syntax even if it is incorrect. In the above case the
browser might "fix" the broken html fragment by assuming "&reg=" is an
entity. Other browsers might interpret it differently. Still others might
just view the page as text since it isn't html.

Saying this is a browser bug will only feed on people generating such
broken urls which will always be a problem with browsers adhering to the
standard ;)

So it would be much better to educate people to actually generate correct
html (by quoting & as &amp; for example).

(This is, btw, the #1 php bug on the web despite the php manual explicitly
warning about this case ;) Interestingly, one rarely sees this bug in perl
code, although the mod_perl guide implicitly say this would be correct
code ;->)

----- End forwarded message -----

Now the rationale. Who defines HTML? What is standard HTML? First of all,
there is no HTML standard. The best thing that comes close is the W3C HTML
Recommendation. Since the W3C and nobody else defines HTML I argue that the
W3C HTML reocmmendations are the most important definition of HTML.

The current HTML version is XHTML1.0 (see http://www.w3.org/TR/html/). No
XML parser will parse the above fragment, as it is clearly incorrect (see
XML definition at http://www.w3.org/TR/REC-xml).

Since the de-facto HTML version in use is HTML4.01, however, I will also
give reasons on why it is also incorrect HTML4.01 (which is an application
of SGML). First of all, in most SGML applications &reg would indeed be a
valid entity reference and, if not defined, would generate a parse error.

In HTML, "&" is an active character like "<". Thinking that a browser
must somehow "guess" at wether it is used as entity start or not is like
requesting that a browser must also guess that "<p=neu</p" might or might
not contain an p element or that "<xx>" is not a valid html element and
should therefore be displayed as text.

In 5.3.2 Character entity references
(http://www.w3.org/TR/html4/charset.html) it is written:

   Authors wishing to put the "<" character in text should use "&lt;" (ASCII
   decimal 60) to avoid possible confusion with the beginning of a tag (start
   tag open delimiter).

And the same advice they apply to "&":
   
   Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid
   confusion with the beginning of a character reference (entity reference
   open delimiter). Authors should also use "&amp;" in attribute values since
   character references are allowed within CDATA attribute values.

These parts of HTML4 use an informal description of HTML (e.g. it
doesn't describe the SGML comment syntax fully). The HTML declaration
(http://www.w3.org/TR/html4/HTML4.decl) makes the role of & explicit, but
as I've written formal SGML doesn't help at all.

Finally, in appendix B of the html4 standard is written:

   Although URIs do not contain non-ASCII values (see [URI], section 2.1)
   authors sometimes specify them in attribute values expecting URIs (i.e.,
   defined with %URI; in the DTD). For instance, the following href value is
   illegal: <A href="http://foo.org/Håkon">...</A>

o.k. wrong excerpt, but sitll interesting. Here is the right excerpt:

   B.2.2 Ampersands in URI attribute values

   The URI that is constructed when a form is submitted may be used
   as an anchor-style link (e.g., the href attribute for the A
   element).  Unfortunately, the use of the "&" character to separate
   form fields interacts with its use in SGML attribute values to
   delimit character entity references. For example, to use the URI
   "http://host/?x=1&y=2" as a linking URI, it must be written <A
   href="http://host/?x=1&amp;y=2">

As a summary:

strict SGML interpretation: & starts an entity and must be escaped
strict HTML interpretation: & starts an entity and must be escaped
loose HTML interpretation: & should be escaped
de-factor browser interpretation: & must be escaped to guarentee correctness
XML interpretation: & starts an entity and must otherwise be escaped

If one argues that the W3c does not define HTML and that there is no HTML
"standard" then this list shortens to a compatibility problem. But in
that case I have to ask: if W3C doesn't define HTML, who does? microsoft?
netscape? ;)

I hope I settled this at least for the mod_perl guide ;)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       [EMAIL PROTECTED]      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |
mod_perl guide corrections: & in uris

Reply via email to