On 2/7/01, Randy Pringle penned:
>We need to parse an HTML page, and remove all HTML tags. Could someone
>please explain how to do this in ColdFusion? There is a component in ASP
>that allows this sort of thing, but we prefer to do it ColdFusion.

You could probably do some sort of custom tag. First, get the 
location of <body, then look for the next instance of > and replace 
everything up to that point with nothing. Then look for </body and 
replace everything from there to the end 'len()' of the template with 
nothing. Then I'd replace all line breaks with nothing, and look for 
any occurrence of <br> and replace with a single line break and all 
instances of <p> and replace with a double line break. Then you'd 
have to systematically loop though and look for any other instance of 
< and the following occurrence of > and replace that string with 
nothing. You could also replace all tabs with nothing, maybe replace 
all instances of (space space) with space | space), then delete all 
instances of (| space) which should leave the template with all 
single spaces. You could also look for instances of & then a 
following instance of ;, look to see if there are any spaces between 
the two and if not. it should mean it's a character code, and delete 
the string.

Once all that is done, you should have a reasonable facsimile of what 
the page would have looked like. If you're trying to keep the 
formatting, lord knows what you'd have to do with tables and forms 
though.
-- 

Bud Schneehagen - Tropical Web Creations

_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/
ColdFusion Solutions / eCommerce Development
[EMAIL PROTECTED]
http://www.twcreations.com/
954.721.3452

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at 
http://www.fusionauthority.com/bkinfo.cfm

Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists

Reply via email to