Re: Cleaning stored text to get valid XML
> This will clean only about 1% of the crap, may be not even... The data in question has been entered using fckeditor, what has taken care of a good bit of the "problem" stuff for me. It was the few things left over that fck didn't deal with that was giving me fits. I see some great potential with cleanWord! Thanks for sharing!!! I'll certainly add it to my toolbox and may pull some bits and pieces as needed for the current project as well!!! > Here is a function that will clen up more, and I'm still improving it ;-) > > function cleanWord (html) > // cleans pasted text from Word > { > //alert(html) > html = html.replace(/\s*<\/o:p>/g, "") ; > html = html.replace(/.*?<\/o:p>/g, "") ; > > // Remove mso-xxx styles. > html = html.replace( /\s*mso-[^:]+:[^;"]+;?/gi, "" ) ; > > // Remove margin styles. > html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*;/gi, "" ) ; > html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*"/gi, "\"" ) ; > > html = html.replace( /\s*TEXT-INDENT: 0cm\s*;/gi, "" ) ; > html = html.replace( /\s*TEXT-INDENT: 0cm\s*"/gi, "\"" ) ; > > html = html.replace( /\s*TEXT-ALIGN: [^\s;]+;?"/gi, "\"" ) ; > > html = html.replace( /\s*PAGE-BREAK-BEFORE: [^\s;]+;?"/gi, "\"" ) ; > > html = html.replace( /\s*FONT-VARIANT: [^\s;]+;?"/gi, "\"" ) ; > > html = html.replace( /\s*tab-stops:[^;"]*;?/gi, "" ) ; > html = html.replace( /\s*tab-stops:[^"]*/gi, "" ) ; > > html = html.replace( /\s*FONT-FAMILY:[^;"]*;?/gi, "" ) ; > > // Remove Class attributes > html = html.replace(/<(\w[^>]*)\s*class=([^ |>]*)([^>]*)/gi, "<$1$3") ; > > // Remove styles. > html = html.replace( /<(\w[^>]*)style="([^\"]*)"([^>]*)/gi, "<$1$3" ) ; > > // Remove empty styles. > html = html.replace( /\s*style="\s*"/gi, '' ) ; > > html = html.replace( /]*>\s* \s*<\/SPAN>/gi, ' ' ) ; > > html = html.replace( /]*>\s*<\/SPAN>/gi, '' ) ; > > // Remove Lang attributes > html = html.replace(/<(\w[^>]*) lang=([^ |>]*)([^>]*)/gi, "<$1$3") ; > > html = html.replace( /([\s\S]*?)<\/SPAN>/gi, '$1' ) ; > html = html.replace( /([\s\S]*?)<\/SPAN>/gi, '$1' ) ; > html = html.replace( /([\s\S]*?)<\/SPAN>/gi, '$1' ) ; > > // remove all font tags > html = html.replace( /<\/?FONT[^>]*>/gi, '' ) ; > html = html.replace( /<\/?FONT[^>]*>/gi, '' ) ; > html = html.replace( /<\/?FONT[^>]*>/gi, '' ) ; > html = html.replace( /<\/?DIV([^>]*)>/gi, '' ) ; > > // Remove XML elements and declarations > html = html.replace(/<\\?\?xml[^>]*>/gi, "") ; > > // Remove Tags with XML namespace declarations: > html = html.replace(/<\/?\w+:[^>]*>/gi, "") ; > > html = html.replace( /\s*<\/H\d>/gi, '' ) ; > > //clean up H tags > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /]*)>/gi, '' ) ; > html = html.replace( /\s*()+<\/P>/gi, '' ) ; > html = html.replace( /<\/P>\s*(<\/P>)+<\/P>/gi, '' ) ; > > html = html.replace( /<(U|I|STRIKE)> <\/\1>/g, ' ' ) ; > > // no comment... > html = html.replace( //gi, '' ) ; > > // transform bullet lists > var re = new RegExp("·( | )*([\\s\\S]*?)", > "gi"); > html = html.replace( re, "$2" ) ; > re = new RegExp("·( | )*([\\s\\S]*?)", "gi"); > html = html.replace( /(|)[§·-]( | )*([\s\S]*?)<\/P>/gi, > "$2" ) ; > // remove spaces at begining > html = html.replace( /^( | )*\s*/, '') ; > // replace all stupid ... because they are > overridden by higher > // style declarations like justify, etc. > html = html.replace( /([\s\S]*?)<\/P>/gi, > '$1' ) ; > // remove useless > html = html.replace( /<\/CENTER>(\s*\s*)/gi, '$1' ) ; > // remove useless in > html = html.replace( /(]*>)\s*\s*/gi, '$1' ) ; > // replace ... inside of TDs > html = html.replace( > /(]*)>\s*([\s\S]*?)<\/CENTER>\s*<\/TD>/gi, > '$1 align=center>$2' ) ; > // remove Paragraphs inside TD > html = > html.replace(/(]*>)\s*]*>([\s\S]*?)\s*<\/P>\s*([\s\S]*?<\/TD>)/gi, > > '$1$2$3'); > > // Remove empty tags (three times, just to make sure). > html = html.replace( /<([^\s>]+)[^>]*>\s*<\/\1>/g, '' ) ; > html = html.replace( /<([^\s>]+)[^>]*>\s*<\/\1>/g, '' ) ; > html = html.replace( /<([^\s>]+)[^>]*>\s*<\/\1>/g, '' ) ; > html = html.replace( /[^\n\r]/gi, '' ) ; > html = html.replace( /[^\n\r]/gi, '' ) ; > > //alert(html) > return (html); > } > ~| ColdFusion MX7 and Flex 2 Build sales & marketing dashboard RIAâs for your business. Upgrade now http://www.adobe.co
Re: Cleaning stored text to get valid XML
>>Finally found a function that seems to clean most of the MS Word control characters and other crap out that was causing me probems. Using two filters on the body text seems to be taking care of my problems now.. This will clean only about 1% of the crap, may be not even... Here is a function that will clen up more, and I'm still improving it ;-) function cleanWord (html) // cleans pasted text from Word { //alert(html) html = html.replace(/\s*<\/o:p>/g, "") ; html = html.replace(/.*?<\/o:p>/g, "") ; // Remove mso-xxx styles. html = html.replace( /\s*mso-[^:]+:[^;"]+;?/gi, "" ) ; // Remove margin styles. html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*;/gi, "" ) ; html = html.replace( /\s*MARGIN: 0cm 0cm 0pt\s*"/gi, "\"" ) ; html = html.replace( /\s*TEXT-INDENT: 0cm\s*;/gi, "" ) ; html = html.replace( /\s*TEXT-INDENT: 0cm\s*"/gi, "\"" ) ; html = html.replace( /\s*TEXT-ALIGN: [^\s;]+;?"/gi, "\"" ) ; html = html.replace( /\s*PAGE-BREAK-BEFORE: [^\s;]+;?"/gi, "\"" ) ; html = html.replace( /\s*FONT-VARIANT: [^\s;]+;?"/gi, "\"" ) ; html = html.replace( /\s*tab-stops:[^;"]*;?/gi, "" ) ; html = html.replace( /\s*tab-stops:[^"]*/gi, "" ) ; html = html.replace( /\s*FONT-FAMILY:[^;"]*;?/gi, "" ) ; // Remove Class attributes html = html.replace(/<(\w[^>]*)\s*class=([^ |>]*)([^>]*)/gi, "<$1$3") ; // Remove styles. html = html.replace( /<(\w[^>]*)style="([^\"]*)"([^>]*)/gi, "<$1$3" ) ; // Remove empty styles. html = html.replace( /\s*style="\s*"/gi, '' ) ; html = html.replace( /]*>\s* \s*<\/SPAN>/gi, ' ' ) ; html = html.replace( /]*>\s*<\/SPAN>/gi, '' ) ; // Remove Lang attributes html = html.replace(/<(\w[^>]*) lang=([^ |>]*)([^>]*)/gi, "<$1$3") ; html = html.replace( /([\s\S]*?)<\/SPAN>/gi, '$1' ) ; html = html.replace( /([\s\S]*?)<\/SPAN>/gi, '$1' ) ; html = html.replace( /([\s\S]*?)<\/SPAN>/gi, '$1' ) ; // remove all font tags html = html.replace( /<\/?FONT[^>]*>/gi, '' ) ; html = html.replace( /<\/?FONT[^>]*>/gi, '' ) ; html = html.replace( /<\/?FONT[^>]*>/gi, '' ) ; html = html.replace( /<\/?DIV([^>]*)>/gi, '' ) ; // Remove XML elements and declarations html = html.replace(/<\\?\?xml[^>]*>/gi, "") ; // Remove Tags with XML namespace declarations: html = html.replace(/<\/?\w+:[^>]*>/gi, "") ; html = html.replace( /\s*<\/H\d>/gi, '' ) ; //clean up H tags html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /]*)>/gi, '' ) ; html = html.replace( /\s*()+<\/P>/gi, '' ) ; html = html.replace( /<\/P>\s*(<\/P>)+<\/P>/gi, '' ) ; html = html.replace( /<(U|I|STRIKE)> <\/\1>/g, ' ' ) ; // no comment... html = html.replace( //gi, '' ) ; // transform bullet lists var re = new RegExp("·( | )*([\\s\\S]*?)", "gi"); html = html.replace( re, "$2" ) ; re = new RegExp("·( | )*([\\s\\S]*?)", "gi"); html = html.replace( /(|)[§·-]( | )*([\s\S]*?)<\/P>/gi, "$2" ) ; // remove spaces at begining html = html.replace( /^( | )*\s*/, '') ; // replace all stupid ... because they are overridden by higher // style declarations like justify, etc. html = html.replace( /([\s\S]*?)<\/P>/gi, '$1' ) ; // remove useless html = html.replace( /<\/CENTER>(\s*\s*)/gi, '$1' ) ; // remove useless in html = html.replace( /(]*>)\s*\s*/gi, '$1' ) ; // replace ... inside of TDs html = html.replace( /(]*)>\s*([\s\S]*?)<\/CENTER>\s*<\/TD>/gi, '$1 align=center>$2' ) ; // remove Paragraphs inside TD html = html.replace(/(]*>)\s*]*>([\s\S]*?)\s*<\/P>\s*([\s\S]*?<\/TD>)/gi, '$1$2$3'); // Remove empty tags (three times, just to make sure). html = html.replace( /<([^\s>]+)[^>]*>\s*<\/\1>/g, '' ) ; html = html.replace( /<([^\s>]+)[^>]*>\s*<\/\1>/g, '' ) ; html = html.replace( /<([^\s>]+)[^>]*>\s*<\/\1>/g, '' ) ; html = html.replace( /[^\n\r]/gi, '' ) ; html = html.replace( /[^\n\r]/gi, '' ) ; //alert(html) return (html); } -- ___ REUSE CODE! Use custom tags; See http://www.contentbox.com/claude/customtags/tagstore.cfm (Please send any spam to this address: [EMAIL PROTECTED]) Thanks. ~| Create Web Applications With ColdFusion MX7 & Flex 2. Build powerful, scalable RIAs. Free Trial http://www.adobe.com/products/coldfusion/flex2/?sdid=RVJS Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:277080 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.hous
Re: Cleaning stored text to get valid XML
Re; http://www.nelsonmullins.com/rss/rss_newsletters.cfm Finally found a function that seems to clean most of the MS Word control characters and other crap out that was causing me probems. Using two filters on the body text seems to be taking care of my problems now.. function ReplaceMicrosoftChars(arg_str) { return ReplaceList(arg_str, "#Chr(19)#,#Chr(20)#,#Chr(25)#,#chr(8216)#,#chr(8217)#,#Chr(8211)#,#Chr(8212)#,#Chr(145)#,#Chr(146)#,#Chr(147)#,#chr(8220)#,#chr(8221)#,#Chr(148)#,#Chr(29)#,#Chr(28)#,#Chr(150)#,#Chr(151)#,#Chr(8230)#", "--,--,',',',--,--,',',"","","","","","",-,-,..."); } ]*>","","all")#" > Feed seems to be working now, until the client finds something else to throw in there that the above doesn't cover!!! Of course, a better way to do this would be to create valid XML text right from the start, but I've got hundreds of records of legacy data to deal with. ~| ColdFusion MX7 and Flex 2 Build sales & marketing dashboard RIAâs for your business. Upgrade now http://www.adobe.com/products/coldfusion/flex2?sdid=RVJT Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:277074 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
RE: Cleaning stored text to get valid XML
If the data is being pasted from word, you will probably need to clean (Replace) the special Microsoft characters that word generates for most of its invisible characters. I know I had problems posting data being pasted from word to my db, and it was those special characters causing the problem. Check the adobe site, they had an article on there somewhere with the hex character for most of the problem text. And you may also need to replace quotes with something else, or remove them altogether. -Original Message- From: Les Mizzell [mailto:[EMAIL PROTECTED] Sent: Thursday, May 03, 2007 8:30 PM To: CF-Talk Subject: Cleaning stored text to get valid XML I've had an application set up for awhile now that allows a user to post and email newsletters from their site. The body text is entered on a form using fckeditor, and since these folks are lawyers, almost everything is pasted from Word and fckeditor is handling whatever is thrown at it. Now, they wish to create a RSS feed from all their newsletters. Oh boy. There's all kinds of crap in the data - curly quotes, apostrophies, HTML tags, and gawd knows what else. I've been going nutz trying to clean the existing text enough to create valid XML text so it will display. I start by getting rid of all the HTML junk, which is working fine: ]*>","","all")#" > After that, it gets a little weird. I've tried all sorts of functions, xmlFormat2.cfm, ConvertSpecialChars ... chaining rereplacenocase to get rid of left and right quotes, apostrophies, whatever other junk I keep finding ... Nothing seems to be getting rid of everything, and the feed still isn't displaying correctly. I know my code base is OK because I've created two other feeds that are working. There's *something* in the text that's still stopping a correct display. How are you folks handling this sort of thing? This one is working: http://www.nelsonmullins.com/rss/rss_press.cfm This one ain't - there's something in the body text somewhere I'm not stripping out... http://www.nelsonmullins.com/rss/rss_newsletters.cfm Suggestions? ~| Upgrade to Adobe ColdFusion MX7 The most significant release in over 10 years. Upgrade & see new features. http://www.adobe.com/products/coldfusion?sdid=RVJR Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:277045 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Cleaning stored text to get valid XML
Robertson-Ravo, Neil (RX) wrote: > Have you tried XMLFormat() around the content? Yup, and the version up now is using it. Still doesn't want to display... http://www.nelsonmullins.com/rss/rss_newsletters.cfm Here's what I've got in there right now: ]*>","","all")#" > ]*>","","all")#" > ~| Upgrade to Adobe ColdFusion MX7 The most significant release in over 10 years. Upgrade & see new features. http://www.adobe.com/products/coldfusion?sdid=RVJR Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:276996 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Cleaning stored text to get valid XML
Have you tried XMLFormat() around the content? "This e-mail is from Reed Exhibitions (Gateway House, 28 The Quadrant, Richmond, Surrey, TW9 1DN, United Kingdom), a division of Reed Business, Registered in England, Number 678540. It contains information which is confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you have received this communication in error please return it to the sender or call our switchboard on +44 (0) 20 89107910. The opinions expressed within this communication are not necessarily those expressed by Reed Exhibitions." Visit our website at http://www.reedexpo.com -Original Message- From: Les Mizzell To: CF-Talk Sent: Fri May 04 02:30:03 2007 Subject: Cleaning stored text to get valid XML I've had an application set up for awhile now that allows a user to post and email newsletters from their site. The body text is entered on a form using fckeditor, and since these folks are lawyers, almost everything is pasted from Word and fckeditor is handling whatever is thrown at it. Now, they wish to create a RSS feed from all their newsletters. Oh boy. There's all kinds of crap in the data - curly quotes, apostrophies, HTML tags, and gawd knows what else. I've been going nutz trying to clean the existing text enough to create valid XML text so it will display. I start by getting rid of all the HTML junk, which is working fine: ]*>","","all")#" > After that, it gets a little weird. I've tried all sorts of functions, xmlFormat2.cfm, ConvertSpecialChars ... chaining rereplacenocase to get rid of left and right quotes, apostrophies, whatever other junk I keep finding ... Nothing seems to be getting rid of everything, and the feed still isn't displaying correctly. I know my code base is OK because I've created two other feeds that are working. There's *something* in the text that's still stopping a correct display. How are you folks handling this sort of thing? This one is working: http://www.nelsonmullins.com/rss/rss_press.cfm This one ain't - there's something in the body text somewhere I'm not stripping out... http://www.nelsonmullins.com/rss/rss_newsletters.cfm Suggestions? ~| Upgrade to Adobe ColdFusion MX7 Experience Flex 2 & MX7 integration & create powerful cross-platform RIAs http://www.adobe.com/products/coldfusion/flex2/?sdid=RVJQ Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:276985 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4