i took a hard look at the w3c html 4.01 specs.  accordingly a uri is
considerd a cname within html.  a cname can contain entities unless it
is a script or a style.  thus a uri in a href may contain entities.
in fact, they do recommend
        <a href="http://example.com?x=1&y=2";>
be encoded
        <a href="http://example.com?x=y&amp;y=2";>
(http://www.w3.org/TR/html401/appendix/notes.html#non-ascii-chars ยง
B.2.2.) this isn't what i've seen in practice, though.

i took the approach of limiting the characters in references to
exactly what occurs in the w3c's list --- "[A-Za-z0-9]+", to allow
characters not in that list to terminate the entitiy (as
http://www.w3.org/TR/html401/charset.html#entities indicates we
should) and finally to not try to prefix-match entities.  the comment
at the top of the file says that the prefix-matching was done for
buggy html.  i wish i knew more about the problem the comments
reference.

if this seems a fruitful way to go, i'll submit a patch.

- erik

; diff -c /n/sources/plan9/sys/src/libhtml/lex.c lex.c
/n/sources/plan9/sys/src/libhtml/lex.c:1245,1251 - lex.c:1245,1251
                        c = getchar(ts);
                        if(c < 0)
                                break;
-                       if(ISNAMCHAR(c)) {
+                       if(c < 256 && (isalpha(c) || isdigit(c))) {
                                if(k < SMALLBUFSIZE-1)
                                        buf[k++] = c;
                        }
/n/sources/plan9/sys/src/libhtml/lex.c:1255,1263 - lex.c:1255,1263
                                break;
                        }
                }
-               if(c >= 0) {
+               if(c >= 256 || !(isalpha(c) || isdigit(c))) {
                        fnd = _lookup(chartab, NCHARTAB, buf, k, &ans);
-                       if(!fnd) {
+                       if(0 && !fnd) {
                                // Try prefixes of s
                                if(c == ';' || c == '\n' || c == '\r')
                                        ungetchar(ts, c);

On Sat Jul  8 08:55:26 CDT 2006, [EMAIL PROTECTED] wrote:

> I told you  it's in the NOTES file.
> 
> lotte% tail -1 /usr/fgb/src/abaco/NOTES 
>          /sys/src/libhtml/lex.c:1258  if(c == ';' ) {

Reply via email to