Looking at Parser.flex...

/* Non whitespace and not close of tag (right angle bracket).  I.e.
 * chars that
 * would not cause an unquoted attribute to end */
NONSEP=[^>\n\r\ \t\b\012:?]
NONSEP_NOQUOTE=[^>\n\r\ \t\b\012:?"]

This I don't understand... "?" or ":" do not terminate the attribute
(meaning the URL in an a href=<unquoted URL>. Presumably it is to reduce
backtracking? Anyway, the proposed modifications are:

NONSEP=[^>\n\r\ \t\b\012:]
NONSEP_NOQUOTE=[^>\n\r\ \t\b\012:"]

......

/* Catch any colon or ?htl= within the URL */
LINK_PATTERNS1={LINK_ATTRS}{WS}={WS}["][^":]*[:][^"]*
LINK_PATTERNS2={LINK_ATTRS}{WS}={WS}({NONSEP_NOQUOTE}{NONSEP}*)?[:]{NONSEP}*
LINK_PATTERNS3={LINK_ATTRS}{WS}={WS}["][^"?]*?htl=
LINK_PATTERNS4={LINK_ATTRS}{WS}={WS}({NONSEP_NOQUOTE}{NONSEP}*)?htl=
LINK_PATTERNS={LINK_PATTERNS1}|{LINK_PATTERNS2}|{LINK_PATTERNS3}|{LINK_PATTERNS4}

This should achieve the functionality we want: block all colons (if we
want to change the port, we should encode it as
__CHECKED_HTTP_hostname_port__ or something), allow ? unless it's part
of a ?htl=... However, I could be grossly mistaken. Comments?

Attachment: msg03717/pgp00000.pgp
Description: PGP signature

Reply via email to