At 15:38 21.11.2002, David Russell spoke out and said:
--------------------[snip]--------------------
>One of the steps I am looking at doing is to replace something "<a
>href="blah" onmouseover="blah"&gt;" with "<a href="blah">"
--------------------[snip]-------------------- 

I found it way easier not to look for encoded values but for the characters
themselves, as it is a lot easier with regexes to scan for characters (or,
better, to scanb for everything EXCEPT a certain character).

So I once took this approach:

Step 1 - extract all "allowed" tags
Step 2 - htmlentitize the string
Step 3 - put the pieces together again

You need to consider that there may be multiple possibilities to write a
link tag (other tags too):
    <a href="foo" title="bar">
    < a title = "bar" href = "foo" any="other">
etc etc.

So you must be looking for the "href" portion, enclosed by (encoded) angle
brackets:
    $re = '/(.*?)(<\s*a\s*[^>]+?href.*?>)(.*)/i';
This reads as
    (       build a group
    .*?     with anything until the very next '<' (below)
    )       end group
    (       build a group 
    <       beginning with '<'
    \s*a\s+ followed by optional blanks and an 'a' followed by at least one
blank
    [^>]*?  followed by anything EXCEPT '>' until the very next
    href    "href"
    .*?     followed by anything until the very next
    >       '>'
    )       end group
The 'i' modifier makes that expression case insenitive.

Next we parse the whole buffer for the href:

    $result = null;
    while ($buffer && $preg_match($re, $buffer, $aresult)) {
        // $aresult is:
        // [0] - whole buffer
        // [1] - pre-match
        // [2] - matched group
        // [3] - post match
        $result .= htmlentities($aresult[1]) . $aresult[2];
        $buffer = $aresult[3];
    }
    $result .= $buffer;

This loops through the data buffer, applying htmlentities() to all parts
except any link tag.

Of course this example only works for the <a href> tag. If you have
multiple tags (and you _do_ have them since you also need to check for the
</a> tag), find ANY tag and check if they are valid:
    $re = '/(.*?)(<\s*)(\/?)([^>]*?)(\s*>)(.*)/';
preg_match will create the following result array:
    [0] - whole buffer
    [1] - prematch
    [2] - tag opener incl. opt. blanks
    [3] - optional '/' for the closing tag
    [4] - tag contents
    [5] - tag closer incl. opt. blanks
    [6] - postmatch
You can then, within your loop, analyze the tag contents (entry [4]) and
decide how to proceed.


-- 
   >O Ernest E. Vogelsinger 
   (\) ICQ #13394035 
    ^ http://www.vogelsinger.at/

Reply via email to