El 21/5/2009, a las 20:51, Wil Macaulay escribió: > Sorry - I was inaccurate in my previous reply (should have refreshed > my memory first by looking at > the code). On the Mac, the native encoding is Unicode - that's the > conceptual basis for the NSString class. There are convenience > functions for accessing the underlying > character buffer as unichar - 16 bits unsigned. So my first step is > to convert the raw file to an NSString > as Unicode, then access the character buffer and sent that to my > parser. This requires my ragel file to use: > > #UniChar type is 16 bits unsigned > > alphtype unsigned short; > > Keywords all fall into the standard ASCII charset - anything that is > outside the ascii character set, > for me, is only interesting in the context of literals (quoted strings > and the like). This means that I can > write my FSM in the normal fashion.
As far as I know, the native encoding for NSString on Mac OS X is UTF-16, which means that the approach you describe will work for most input, but fall down for any code points which require surrogates (not all code points can be represented in 16 bits, so some of them require an additional 16 bits, forming a surrogate pair). The approach would work fine if the input was in UCS-2 (which always fits in 16 bits, but which can't represent all Unicode code points). So I guess it all depends on the kind of input the original poster is expecting. If it's user-supplied (untrusted input) and he wants to work with UTF-16 then he should probably gracefully handle surrogates, even if he isn't expecting them. This Wikipedia article explains all this in a lot more detail: http://en.wikipedia.org/wiki/UTF-16 Wincent _______________________________________________ ragel-users mailing list ragel-users@complang.org http://www.complang.org/mailman/listinfo/ragel-users