RE: Regex help with invalid HTML

Mark Henderson Tue, 17 Nov 2009 12:11:33 -0800

Peter Boughton wrote on Wed 18/11/2009 at 03:12:

> The only time parsing HTML with RegEx might be remotely viable is when
you know
> what that code will be - if the HTML is uncontrolled then using RegEx
is a futile effort.
> 
> RegEx is for dealing with Regular text, and HTML is not a Regular
language - even
> modern regex engines that implement non-Regular features *cannot* deal
with the
> potential complexity of HTML.
> 
> The correct solution is to **use a tool designed for parsing HTML**.


Ok Peter, thanks for the heads-up.


> There isn't one native to CF, but there are a number of Java ones
available - take a
> look at:
> http://java-source.net/open-source/html-parsers
> 
> I haven't used any of those, I'd probably start with TagSoup or
NekoHTML since they
> look promising, but any HTML parser that produces a DOM structure
which you can
> run XPath expressions against will allow you to extract the specific
information you
> want.

TagSoup it is.

Mark

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Want to reach the ColdFusion community with something they want? Let them know 
on the House of Fusion mailing lists
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:328478
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

RE: Regex help with invalid HTML

Reply via email to