eric, I don't know if you noticed it, but your diff doesn't solve
the problem.
--- Begin Message ---
i took a hard look at the w3c html 4.01 specs. accordingly a uri is
considerd a cname within html. a cname can contain entities unless it
is a script or a style. thus a uri in a href may contain entities.
in fact, they do recommend
<a href="http://example.com?x=1&y=2">
be encoded
<a href="http://example.com?x=y&y=2">
(http://www.w3.org/TR/html401/appendix/notes.html#non-ascii-chars ยง
B.2.2.) this isn't what i've seen in practice, though.
i took the approach of limiting the characters in references to
exactly what occurs in the w3c's list --- "[A-Za-z0-9]+", to allow
characters not in that list to terminate the entitiy (as
http://www.w3.org/TR/html401/charset.html#entities indicates we
should) and finally to not try to prefix-match entities. the comment
at the top of the file says that the prefix-matching was done for
buggy html. i wish i knew more about the problem the comments
reference.
if this seems a fruitful way to go, i'll submit a patch.
- erik
; diff -c /n/sources/plan9/sys/src/libhtml/lex.c lex.c
/n/sources/plan9/sys/src/libhtml/lex.c:1245,1251 - lex.c:1245,1251
c = getchar(ts);
if(c < 0)
break;
- if(ISNAMCHAR(c)) {
+ if(c < 256 && (isalpha(c) || isdigit(c))) {
if(k < SMALLBUFSIZE-1)
buf[k++] = c;
}
/n/sources/plan9/sys/src/libhtml/lex.c:1255,1263 - lex.c:1255,1263
break;
}
}
- if(c >= 0) {
+ if(c >= 256 || !(isalpha(c) || isdigit(c))) {
fnd = _lookup(chartab, NCHARTAB, buf, k, &ans);
- if(!fnd) {
+ if(0 && !fnd) {
// Try prefixes of s
if(c == ';' || c == '\n' || c == '\r')
ungetchar(ts, c);
On Sat Jul 8 08:55:26 CDT 2006, [EMAIL PROTECTED] wrote:
> I told you it's in the NOTES file.
>
> lotte% tail -1 /usr/fgb/src/abaco/NOTES
> /sys/src/libhtml/lex.c:1258 if(c == ';' ) {
--- End Message ---