On Fri, 17 Aug 2001, Craig S. Cottingham wrote:
> 
> It goes on further to define "scheme" as matching the regex /^[a-
> z][a-z0-9+.-]+$/i.
> 
> So, it appears that in URIs and (as a subset) URLs, not only are 
> "mail" and "news" valid schemes, but so are an infinity of other 
> strings. Furthermore, without constraining the original problem 
> (find URLs in a string), it seems that a multitude of false 
> positives will be generated.

Perhaps not.  First of all, just how many words in typical English text
have a colon embedded in them?  This alone should get us pretty far:

m{([a-z][a-z0-9+.-]+:(?:[a-z0-9;/?:@&=+$,_.!~*'()-]|%[a-f0-9]{2})+)}ig;

One can do better by applying scheme-specific restrictions, but for the
most commonly used schemes that doesn't help much -- http, for example,
restricts the authority section to <hostname>[:<port>] (the extended
[<user>[:<passwd>]@]<hostname>[:<port>] syntax is non-standard IIRC, but
should probably still be matched) but applies no restrictions at all to
the path or the optional query section.

It's also debatable whether it really makes sense to check for % escape
correctness like I did above -- it probably gains very little over:

m{([a-z][a-z0-9+.-]+:[a-z0-9;/?:@&=+$,_.!~*'()%-]+)}ig;

and in fact this may be more robust when dealing with URLs containing
non-standard escape codes.

By the way, depending on the application, you probably want to add # to
the list of acceptable characters above to match fragment identifiers.

-- 
Ilmari Karonen - http://www.sci.fi/~iltzu/
"_Good_ sigmonster. Here, have a spammer."  -- Mike Andrews in the monastery

Reply via email to