Re: Replacing special characters in a XML document

Jesse Houwing Sat, 21 Sep 2002 01:07:38 -0700

S. Isaac Dealey wrote:
>>Ok, I've got a little problem here. I'm reading an XML file from a third
>>party and displaying it's content. The problem is that the third party
>>is not checking for illegal characters in the XML file. So things like:
> 
> 
>><news>He & me are</news>
> 
> 
>>will show up in the damn thing. So I want to replace the special
>>characters, but only those that are outside of the tags. I've probably
>>got to use a regexp for this, but I'm not sure how to do this. I know I
>>can select part of a sctring with regexp and replce it with a changed
>>version of thet string, but how is that done efficiently, and in one
>>REReplace (I know it can be done, but don't know how).
> 
> 
>>Anyone?
> 
> 
>>Jesse
> 
> 
> Unfortunately, while you can use back-references to return a portion of a
> found regular expression back to the replacement, you can't use any kind of
> functions or conditional logic on these back-references, so you'd have to
> replace each character individually... As for actually getting the illegal
> characters, try something like this:
> 
> <cfset illegalchar = REFind(">[^<]*?[^ _-\.[:alnum:]][^<]*?",myxmlpacket)>
> 
> This should give you the location of the first illegal character in the
> packet, within the contents of an element, assuming that an illegal
> character is anything other than a space, underscore, hyphen, dot or
> alpha-numeric character... That's probably not a real good definition for
> illegal characters, but it's a starting point. :)
> 
> Once you know where that character is, then you can replace it with
> something like <char=#asc(illegalcharacter)#> or whatever the spec. is for
> special characters in your xml dtd. Am I using the terminology correctly?


Ok, I found a solution, it works fine, but could use a bit op 
optimization I think. But I first check IF the document is valid, and if 
not parse it, so the impact should not be too high, as they usually DO 
give a valid XML to parse.

The solution is this:

<cfscript>                                      ct=htmleditformat(cfhttp.filecontent, 
-1);
ct=REReplace(ct, "(<[^&>]*)&quot;([^>]*>)", "<\1""\2>", "ALL");
ct=Replace(ct, "&lt;", "<", "ALL");
ct=Replace(ct, "&gt;", ">", "ALL");
newct="";
while (not ct is newct){
   newct=ct;
   ct=REReplaceNoCase(ct, "(<[^>&]*)&quot;([^>]*>)", "\1""\2", "ALL");
}
ct=REReplace(ct, "&amp;([a-zA-Z]*);", "&\1;", "ALL");
</cfscript>


And it works like a charm :)

Jesse

______________________________________________________________________
Get the mailserver that powers this list at http://www.coolfusion.com
FAQ: http://www.thenetprofits.co.uk/coldfusion/faq
Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists

Re: Replacing special characters in a XML document

Reply via email to