Hi Jens,

Many thanks again for the help. Sorry I wasn't clearer about what I meant when 
I said "invalid UTF8" - I was using it in the context of XML, but should have 
made that more explicit. Also, I owe you an apology as I had completely missed 
NSMutableCharacterSet's -addCharactersInRange:, which as you say is exactly 
what I needed; I should have re-checked the NSCharacterSet methods after your 
first reply.

So, I hope I have a solution. I use NSMutableCharacterSet to create a character 
set containing all valid XML unicode characters, then invert it so I have all 
invalid characters, then check for these invalid characters and delete them. My 
NSString category method is below:

- (NSString *)validXMLString
{
        // Not all UTF8 characters are valid XML.
        // See:
        // http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
        // (Also see: 
http://cse-mjmcl.cse.bris.ac.uk/blog/2007/02/14/1171465494443.html )
        //
        // The ranges of unicode characters allowed, as specified above, are:
        // Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
[#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, 
FFFE, and FFFF. */
        //
        // To ensure the string is valid for XML encoding, we therefore need to 
remove any characters that
        // do not fall within the above ranges.

        // First create a character set containing all invalid XML characters.
        // Create this once and leave it in memory so that we can reuse it 
rather
        // than recreate it every time we need it.
        static NSCharacterSet *invalidXMLCharacterSet = nil;
        
        if (invalidXMLCharacterSet == nil)
        {
                // First, create a character set containing all valid UTF8 
characters.
                NSMutableCharacterSet *XMLCharacterSet = 
[[NSMutableCharacterSet alloc] init];
                [XMLCharacterSet addCharactersInRange:NSMakeRange(0x9, 1)];
                [XMLCharacterSet addCharactersInRange:NSMakeRange(0xA, 1)];
                [XMLCharacterSet addCharactersInRange:NSMakeRange(0xD, 1)];
                [XMLCharacterSet addCharactersInRange:NSMakeRange(0x20, 0xD7FF 
- 0x20)];
                [XMLCharacterSet addCharactersInRange:NSMakeRange(0xE000, 
0xFFFD - 0xE000)];
                [XMLCharacterSet addCharactersInRange:NSMakeRange(0x10000, 
0x10FFFF - 0x10000)];
                
                // Then create and retain an inverted set, which will thus 
contain all invalid XML characters.
                invalidXMLCharacterSet = [[XMLCharacterSet invertedSet] retain];
                [XMLCharacterSet release];
        }
        
        // Are there any invalid characters in this string?
        NSRange range = [self rangeOfCharacterFromSet:invalidXMLCharacterSet];
        
        // If not, just return self unaltered.
        if (range.length == 0)
                return self;
        
        // Otherwise go through and remove any illegal XML characters from a 
copy of the string.
        NSMutableString *cleanedString = [self mutableCopy];
        
        while (range.length > 0)
        {
                [cleanedString deleteCharactersInRange:range];
                range = [cleanedString 
rangeOfCharacterFromSet:invalidXMLCharacterSet];
        }
        
        return (NSString *)[cleanedString autorelease];
}

As the invalid character set is only created once, as as nothing is done if the 
string has no invalid XML characters, this seems to run pretty fast and do what 
I need.

Many thanks again!
All the best,
Keith

--- On Fri, 1/29/10, Jens Alfke <[email protected]> wrote:

> From: Jens Alfke <[email protected]>
> Subject: Re: NSXML and invalid UTF8 characters
> To: "Keith Blount" <[email protected]>
> Cc: [email protected]
> Date: Friday, January 29, 2010, 3:23 AM
> 
> On Jan 28, 2010, at 3:47 PM, Keith Blount wrote:
> 
> > Many thanks for your reply. Wouldn't using these
> methods be a lot more expensive (and slower) than going
> through using -characterAtIndex: or something similar,
> accessing the characters directly, though?
> 
> No, because it's more efficient to let NSString itself do
> the searching, avoiding the overhead of a message-send per
> character.
> 
> > I'm thinking that I would have to add every character
> to the character set and then let NSString deal with all the
> underlying character stuff this way, whereas if I could
> check the unicode char is within a range then it would be
> faster.
> 
> You can easily create an NSCharacterSet on any range of
> Unicode values.
> 
> BTW, it's inaccurate to say "invalid UTF-8". UTF-8 is just
> an encoding of Unicode. You're talking about Unicode
> characters that are illegal in XML. (I bring this up because
> there is such a thing as invalid UTF-8, i.e. byte sequences
> that are invalid in UTF-8 encoding, but it's an entirely
> different issue; this confused me when I first read your
> message.)
> 
> —Jens
> 
> 



_______________________________________________

Cocoa-dev mailing list ([email protected])

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [email protected]

Reply via email to