A. Pagaltzis <[EMAIL PROTECTED]> wrote: > More than those you mention - because it doesn't parse HTML, > just looks for some string bits. It will blow up on > > <img alt="<a r g h>" ...>
True, but in the real world (or at least that part of it I experience), you're more likely to run into something like <img src=http://www.example.com/images/abcd.gif> which will be handled by the regex but may cause a parser to blow up (though some are more tolerant than others). It's sad that such code exists, it's sad that browsers tolerate it without complaint, but we have to deal with it. Unfortunately, stuff on pages encountered in the wild often isn't valid HTML -- in fact that was the whole point of the exercise here. Valid HTML would have had the closing tags already. And the stuff being produced isn't valid HTML either, since the tags may be misnested. Sometimes parsing is overkill. If regexes are good enough for Tim Bray, they're good enough for me: | That leaves input data munging, which I do a lot of, and a | lot of input data these days is XML. Now here's the dirty | secret; most of it is machine-generated XML, and in most | cases, I use the perl regexp engine to read and process | it. I've even gone to the length of writing a prefilter to | glue together tags that got split across multiple lines, | just so I could do the regexp trick. http://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog -- Keith C. Ivey <[EMAIL PROTECTED]> Washington, DC
