Peter Boughton wrote on Wed 18/11/2009 at 03:12: > The only time parsing HTML with RegEx might be remotely viable is when you know > what that code will be - if the HTML is uncontrolled then using RegEx is a futile effort. > > RegEx is for dealing with Regular text, and HTML is not a Regular language - even > modern regex engines that implement non-Regular features *cannot* deal with the > potential complexity of HTML. > > The correct solution is to **use a tool designed for parsing HTML**.
Ok Peter, thanks for the heads-up. > There isn't one native to CF, but there are a number of Java ones available - take a > look at: > http://java-source.net/open-source/html-parsers > > I haven't used any of those, I'd probably start with TagSoup or NekoHTML since they > look promising, but any HTML parser that produces a DOM structure which you can > run XPath expressions against will allow you to extract the specific information you > want. TagSoup it is. Mark ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| Want to reach the ColdFusion community with something they want? Let them know on the House of Fusion mailing lists Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328478 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

