Looking at Parser.flex... /* Non whitespace and not close of tag (right angle bracket). I.e. * chars that * would not cause an unquoted attribute to end */ NONSEP=[^>\n\r\ \t\b\012:?] NONSEP_NOQUOTE=[^>\n\r\ \t\b\012:?"]
This I don't understand... "?" or ":" do not terminate the attribute (meaning the URL in an a href=<unquoted URL>. Presumably it is to reduce backtracking? Anyway, the proposed modifications are: NONSEP=[^>\n\r\ \t\b\012:] NONSEP_NOQUOTE=[^>\n\r\ \t\b\012:"] ...... /* Catch any colon or ?htl= within the URL */ LINK_PATTERNS1={LINK_ATTRS}{WS}={WS}["][^":]*[:][^"]* LINK_PATTERNS2={LINK_ATTRS}{WS}={WS}({NONSEP_NOQUOTE}{NONSEP}*)?[:]{NONSEP}* LINK_PATTERNS3={LINK_ATTRS}{WS}={WS}["][^"?]*?htl= LINK_PATTERNS4={LINK_ATTRS}{WS}={WS}({NONSEP_NOQUOTE}{NONSEP}*)?htl= LINK_PATTERNS={LINK_PATTERNS1}|{LINK_PATTERNS2}|{LINK_PATTERNS3}|{LINK_PATTERNS4} This should achieve the functionality we want: block all colons (if we want to change the port, we should encode it as __CHECKED_HTTP_hostname_port__ or something), allow ? unless it's part of a ?htl=... However, I could be grossly mistaken. Comments?
msg03717/pgp00000.pgp
Description: PGP signature