Bug#408126: URL parsing doesn't handle UTF-8 characters

Martin Michlmayr Sat, 27 Jan 2007 11:26:00 -0800

* Daniel Burrows <[EMAIL PROTECTED]> [2007-01-26 21:04]:
>   Ick.  Is that even a legal URL?


I have no idea, but it works when I paste it into Firefox.

>   One idea would be to use an exclusive, rather than inclusive, regexp
> to find URLs.  Consider a URL to be anything starting with "http" and
> terminated by whitespace or a few special characters (say, ["',.?>]).
> This seems more failure-prone, though, as I can't possibly predict every
> convention people use to terminate URLs (e.g., what about » or ¿; I'm
> sure there are more I don't know).

Yes, I agree that it'll be hard to get it right all the time.  I'd
personally assume that ' was part of the URL but for example » isn't;
but I might be wrong, and any rule you put into urlscan will get it
wrong in some cases. :/

>   Another option would be to add a command to "lengthen" a match, telling
> urlscan to update the currently selected match with the immediately
> following character (or maybe the next character & everything else that
> looks like part of a URL).  This might be the best solution, since weird
> URLs like that seem like an oddity, and urlscan will probably make
> inevitable errors in other situations anyway.

Something like this would probably be best.

-- 
Martin Michlmayr
http://www.cyrius.com/

Bug#408126: URL parsing doesn't handle UTF-8 characters

Reply via email to