On 2/7/01, Randy Pringle penned:
>We need to parse an HTML page, and remove all HTML tags. Could someone
>please explain how to do this in ColdFusion? There is a component in ASP
>that allows this sort of thing, but we prefer to do it ColdFusion.
You could probably do some sort of custom tag. First, get the
location of <body, then look for the next instance of > and replace
everything up to that point with nothing. Then look for </body and
replace everything from there to the end 'len()' of the template with
nothing. Then I'd replace all line breaks with nothing, and look for
any occurrence of <br> and replace with a single line break and all
instances of <p> and replace with a double line break. Then you'd
have to systematically loop though and look for any other instance of
< and the following occurrence of > and replace that string with
nothing. You could also replace all tabs with nothing, maybe replace
all instances of (space space) with space | space), then delete all
instances of (| space) which should leave the template with all
single spaces. You could also look for instances of & then a
following instance of ;, look to see if there are any spaces between
the two and if not. it should mean it's a character code, and delete
the string.
Once all that is done, you should have a reasonable facsimile of what
the page would have looked like. If you're trying to keep the
formatting, lord knows what you'd have to do with tables and forms
though.
--
Bud Schneehagen - Tropical Web Creations
_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
ColdFusion Solutions / eCommerce Development
[EMAIL PROTECTED]
http://www.twcreations.com/
954.721.3452
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at
http://www.fusionauthority.com/bkinfo.cfm
Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists