John Von Essen [mailto:[EMAIL PROTECTED]] wrote:
> ## Email me for explaination of Regex
> if($_ =~ m/http:\/\/([\w\d]+(-+[\w\d]+)?\.)+[\w]{2,3}(\/.*)?/)
^^^^^
> This will only print out internet urls like:
>
> http://www.h-p.com.au/
> http://links.com/
> http://w.w.w.w.w.com/
>
> NOT intranet urls like:
>
> http://host/
Your regex is not up to date.
It would filter out valid internet URLs for the new top-level domains with
more than three letters like *.info.
Furthermore \w is not only letters but includes at least [a-zA-Z0-9_].
(It may include additional letters depending on your locale setting.)
Thus \d is a subset of \w.
And You don't check that 'http:' is at the beginning of the URL.
Thus, "http://localhost/script?http://www.inter.net/foo/bar" would pass,
though it is an intranet URL.
I suggest the following regex:
m#^https?\://(\w+(-+\w+)?\.)+[a-z]{2,}(\:\d+)?(/|$)#i
(I've used # as paranthesis for readability).
Notes:
- URLs with a '_' in the hostname pass, though they are invalid IIRC.
- URLs with raw IP numbers like http://127.0.0.0/ are rejected.
It is probably best to check them by a second regex and then handle them
depending on the IP number
- it only covers http and https URLs.
Is it save to replace "^https?" by "^[a-z]+" ?
Of course some internet URLs would still be rejected eg. "mailto:"
But could some intranet URLs match the regex?
Ciao, Claus