You need to either write a UTF-8 stream or convert to UTF32 and write that instead. The conversion routines are already in the runtime. I have been meaning to get around to providing a universal input stream but the input streams are easy to code, all you need to do is copy one of the others and change the pointer calculations. Dealing with UTF-8 directly is trickier because you must scan all the time (in general) so the overhead of decoding isn't much but the algorithmic complexities come with overhead. Easiest to just convert on the fly I think unless your input file are massive.
8bit refers to just that, it makes no assumptions about encoding, it just deals with 8 bit wide input elements. Similarly the UCS2 stream isn't really UCS2, it is just a 16 bit input character. Generally the string factory stuff is just for convenience and there are better ways to handle your strings in a specific way for you, but again you can copy the methods and implement them for whatever you need. Jim > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of [email protected] > Sent: Saturday, December 05, 2009 9:46 AM > To: [email protected] > Subject: [antlr-dev] UTF8 file/input stream? (C Runtime) > > Hey guys, I was wondering if anyone out there has a patch that > implements > a UTF8 stream already, or if I need to write one myself. > > In the latter case, I am a bit confused about the structure of the > stream > objects. > > The UCS2 input stream for instance sets the character width to 2, and > can > of course use the ascii 8bit file stream directly without needing to > decode. > > I was wondering if the character width variable is actually used > anywhere, > or if I can safely set it to 0 and decode UTF8 on the fly. > > The alternative, which is suggested in the header, is to decode the > UTF8 > to UCS4 when reading the file, and use a UCS4 stream. This is easily > doable, but it seems like a waste of memory. (decoding UTF8 is > NEGLIGIBLE!! performance wise, these days). > > At any rate, if I am using UCS4 internally, the other issue is the > string > factory class. I haven't looked much at it yet, but the uses of the > various methods in the string factory interface are confusing. It looks > like I need to implement a [my encoding] to [my encoding] function, a > '8bit' to [my encoding] and so on. I assume that '8bit' refers to > ascii, > but it's not really clear. So I was just wondering if I could get some > clarification on exactly what these interfaces need to do, so that I > can > implement one properly. > > On the other hand, as mentioned at the beginning of this post, if a > UTF8 > stream has already been implemented, all I would like is a link to it! > > Cheers. > > _______________________________________________ > antlr-dev mailing list > [email protected] > http://www.antlr.org/mailman/listinfo/antlr-dev _______________________________________________ antlr-dev mailing list [email protected] http://www.antlr.org/mailman/listinfo/antlr-dev
