Michael,
This is a function I wrote to convert HTML to text for a newsletter system.
It works well for the most part and tries to retain mailto: and http: style
links etc..
The function call to StripHTML at the end is pretty much the same as your
REReplace for the remaining HTML tags.
Let me know if you use it and make any improvements to it.
Paul
<cffunction name="HTMLtoText" access="public" returntype="string"
output="false"
hint="Formats the text newsletter by parsing any
HTML and replacing it with relevant formatting.">
<cfargument name="theHTML" type="string" required="true">
<cfset var theText = arguments.theHTML>
<cfset var eMailLinkStart = 0>
<cfset var eMailLinkEnd = 0>
<cfset var eMailLink = "">
<cfset var eMailLinkInfo = "">
<cfset var emailAddress = "">
<!--- parse the input string for special HTML cases and
replace them with the relevant text equivalent --->
<cfset theText = reReplaceNoCase(theText, "<p[^>]*>", "",
"ALL")>
<cfset theText = reReplaceNoCase(theText,
"<br[^>]*>|<div[^>]*>|</ul>", chr(13)&chr(10), "ALL")>
<cfset theText = replaceNoCase(theText, "</p>",
chr(13)&chr(10)&chr(13)&chr(10), "ALL")>
<cfset theText = reReplaceNoCase(theText,
"’|‘|&##39;", "'", "ALL")>
<cfset theText = reReplaceNoCase(theText, "<li[^>]*>",
chr(13)&chr(10)&chr(9)&"* ", "ALL")>
<cfset theText = reReplaceNoCase(theText, "<hr[^>]*>",
chr(13)&chr(10)&"-----------------------------------------------------------
-----------------"&chr(13)&chr(10), "ALL")>
<cfset theText = replaceNoCase(theText, " ", " ",
"ALL")>
<cfset theText = replaceNoCase(theText, """, """",
"ALL")>
<cfset theText = replaceNoCase(theText, "http://www.",
"www.", "ALL")>
<cfset theText = replaceNoCase(theText, "www.",
"http://www.", "ALL")>
<cfset eMailLinkStart =
reFindNoCase("<a[^>]*href=""mailto:[^>]*>", theText)>
<cfloop condition="eMailLinkStart GT 0">
<cfset eMailLinkEnd = FindNoCase("</a>", theText,
eMailLinkStart)>
<cfset eMailLink = Mid(theText, eMailLinkStart,
(eMailLinkEnd-eMailLinkStart)+4)>
<cfset eMailAddress = eMailLink>
<cfset eMailLinkStart = reFindNoCase("mailto:",
eMailAddress)>
<cfset eMailLinkEnd = reFindNoCase("""",
eMailAddress, eMailLinkStart)>
<cfset eMailAddress = Mid(eMailAddress,
eMailLinkStart, (eMailLinkEnd-eMailLinkStart))>
<cfset theText = ReplaceNoCase(theText, eMailLink ,
replaceNoCase(eMailAddress, "mailto:", ""))>
<cfset eMailLinkStart =
reFindNoCase("<a[^>]*href=""mailto:[^>]*>", theText)>
</cfloop>
<cfset theText = reReplaceNoCase(theText,
"[^:[EMAIL PROTECTED]", " mailto:\0", "all")>
<cfset theText = reReplaceNoCase(theText, "mailto:[\s]",
"mailto:", "All")>
<cfset theText = ReplaceNoCase(theText, chr(13)&" mailto:",
chr(13)&chr(10)&"mailto:", "All")>
<!--- strip out any remaining HTML tags --->
<cfset theText = variables.system.stripHTML(theText)>
<cfreturn replaceNoCase(theText, "[!--unsubscribelink--]",
"<!--unsubscribelink-->")>
</cffunction>
> -----Original Message-----
> From: Michael Dinowitz [mailto:[EMAIL PROTECTED]
> Sent: 13 August 2007 16:53
> To: RegEx
> Subject: html email parsing
>
> Currently, these lists block mail that does not have a text portion.
> This means that HTML only email does not get through. I've decided to
> try out some code to take those emails that are HTML only and strip
> out the HTML while trying to retain some of the line formatting. I
> came up with this:
> <!--- Replace all breaks with a single carriage return --->
> <cfset string=replacenocase(string, '<br>', chr(10), 'all')>
> <!--- Remove all HTML tags --->
> <cfset string=rereplace(string, '<[^>]+>', '', 'all')>
> <!--- If there are 3 or more newlines in a row, turn them into 2
> newlines. Some newlines have spaces after them. Finally, trim the
> string --->
> <cfset string=trim(rereplace(string, '(?:(\n|\r){2}\s*){3,}', '\1\1',
> 'all'))>
>
> Anyone see a problem here or a way to do it better?
>
> Thanks
>
> --
> Michael Dinowitz
> President: House of Fusion (http://www.houseoffusion.com)
> Publisher: Fusion Authority (http://www.fusionauthority.com)
> Adobe Community Expert / Advanced Certified ColdFusion Professional
>
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Enterprise web applications, build robust, secure
scalable apps today - Try it now ColdFusion Today
ColdFusion 8 beta - Build next generation apps
Archive: http://www.houseoffusion.com/groups/RegEx/message.cfm/messageid:1051
Subscription: http://www.houseoffusion.com/groups/RegEx/subscribe.cfm
Unsubscribe:
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.21