On Thursday 08 March 2001 09:19, you wrote:
> I'm putting together a regex to pull all of the urls out of a web page.
> Not the href tag, but just the url part of that tag.
>
> Here's what I've come up with:
>
> preg_match_all('/<.*href\s*=\s*(\"|\')?(.*?)(\s|\"|\'|>)/i', $html,
> $matches);
> foreach($matches[2] as $m) print "<P>$m\n";
>
> All regex masters please tell me if I'm missing something. It's working
> well, but I'm still learning about perl regex and I'd like any input if
> at all possible.

Pretty good. Some minor things:
(1) "<.*href" will also match <a name='foo'><h2> example of 
href="hello"</h2>

(2) You're pretty lax on the tag syntax - you don't require quotes to 
match, don't require quotes at all etc

I'd rewrite it to
'/<\s*a\s*href\s*=\s*("|\')(.*?)\\1\s*>/i'

> What's a good way to exclude things like javascript: urls and other non
> URI info? I guess what I'm really looking for is all the http urls, no
> ftp, mms etc... or anything like that.

Use the following regexes on your result to get only the types you want:

        $HostName = '([a-zA-Z][\w-]*(\.[a-zA-Z][\w-]*)+)';
        $HostIP   = '(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})';
        $Host     = '(' . $HostName . '|' . $HostIP . ')';
        $HTTPPath = '(([\w\.\/~\?%=\-]|&amp;)+)';
        $FTPPath  = '(\/[^\/\s]*)*\/?';
        $Port     = '(:\d+)?';
        $HTTPURL = 'http:\/\/' . $Host . $Port . $HTTPPath;
        $WWWAddress = 'www\.' . $HostName . $Port . $HTTPPath;
        $FTPURL = 'ftp:\/\/' . $Host . $FTPPath;

-- 
Christian Reiniger
LGDC Webmaster (http://sunsite.dk/lgdc/)

Very funny, Scotty! Now beam up my clothes...

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Reply via email to