On Fri, Oct 01, 2010 at 06:23:06PM +0200, PyroPeter wrote: > About splitting at boundaries: Contrary to what I have said before, > using regular expressions seems to be a valid and efficient way. > (I thought you would have to escape tag-content and attributes in > different ways (percent-encoding vs. html-entities). After reading > the HTML4 specification I realized this is not the case, as content and > attributes are both escaped using html-entities)
Using regular expressions is an efficient way, but they should be applied before htmlspecialchars() or anything similar is applied. E.g. we could use preg_match() or preg_match_all() with PREG_OFFSET_CAPTURE to get the positions of all links, then call a function, that converts links and converts the stings as necessary, and convert the parts that don't contain any links separately using htmlspecialchars().
