> -----Original Message-----
> From: farn...@googlemail.com [mailto:farn...@googlemail.com] On Behalf
> Of Edmund Hertle
> Sent: Thursday, January 15, 2009 4:13 PM
> To: PHP - General
> Subject: [PHP] Parsing HTML href-Attribute
> 
> Hey,
> I want to "parse" a href-attribute in a given String to check if there
> is a
> relative link and then adding an absolute path.
> Example:
> $string  = '<a class="sample" [...additional attributes...]
> href="/foo/bar.php" >';
> 
> I tried using regular expressions but my knowledge of RegEx is very
> limited.
> Things to consider:
> - $string could be quite long but my concern are only those href
> attributes
> (so working with explode() would be not very handy)
> - Should also work if href= is not using quotes or using single quotes
> - link could already be an absolute path, so just searching for href=
> and
> then inserting absolute path could mess up the link
> 
> Any ideas? Or can someone create a RegEx to use?

Just spitballing here, but this is probably how I would start:

RegEx pattern: /<a.*? href=(.+?)>/ig

Then, using the capture group, determine if the href attribute uses quotes 
(single or double, doesn't matter). If it does, you don't need to worry about 
splitting the capture group at the first white space. If it doesn't, then you 
must assume the first whitespace is the end of the URL and the beginning of 
additional attributes, and just grab the URL up to (but not including) the 
first whitespace.

So...

<?php

# here is where $anchorText (text for the <a> tag) would be assigned
# here is where $curDir (text for the current directory) would be assigned

# find the href attribute
$matches = Array();
preg_match('#<a.*? href=(.+?)>#ig', $anchorText, $matches);

# determine if it has surrounding quotes
if($matches[1][0] == '\'' || $matches[1][0] == '"')
{
        # pull everything but the first and last character
        $anchorText = substr($anchorText, 1, strlen($anchorText) - 3);
}
else
{
        # pull up to the first space (if there is one)
        $spacePos = strpos($anchorText, ' ');   
        if($spacePos !== false) 
                $anchorText = substr($anchorText, 0, strpos($anchorText, ' '))
}

# now, check to see if it is relative or absolute
# (regex pattern searches for protocol spec (i.e., http://), which will be
# treated as an absolute path for the purpose of this algorithm)
if($anchorText[0] != '/' && preg_match('#^\w+://#', $anchorText) == 0)
{
        # add current directory to the beginning of the relative path
        # (nothing is done to absolute paths or URLs with protocol spec)
        $anchorText = $curDir . '/' . $anchorText;
}

echo $anchorText;

?>

...UNTESTED.

HTH,


// Todd

Reply via email to