On Tue, Nov 27, 2012 at 3:01 PM, Richard Smith <rich...@metafoo.co.uk> wrote: > On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.fried...@gmail.com> > wrote: >> >> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <rich...@metafoo.co.uk> >> wrote: >> > I had a look at supporting UTF-8 in source files, and came up with the >> > attached approach. getCharAndSize maps UTF-8 characters down to a char >> > with >> > the high bit set, representing the class of the character rather than >> > the >> > character itself. (I've not done any performance measurements yet, and >> > the >> > patch is generally far from being ready for review). >> > >> > Have you considered using a similar approach for lexing UCNs? We already >> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with >> > them >> > there. Also, validating the codepoints early would allow us to recover >> > better (for instance, from UCNs encoding whitespace or elements of the >> > basic >> > source character set). >> >> That would affect the spelling of the tokens, and I don't think the C >> or C++ standard actually allows us to do that. > > > If I understand you correctly, you're concerned that we would get the wrong > string in the token's spelling? When we build a token, we take the > characters from the underlying source buffer, not the value returned by > getCharAndSize.
Oh, I see... so the idea is to hack up getCharAndSize instead of calling isUCNAfterSlash/ConsumeUCNAfterSlash where we expect a UCN, use a marker which essentially means "saw a UCN". Seems like a workable approach; I don't think it actually helps any with error recovery (I'm pretty sure we can't diagnose anything without knowing what kind of token we're forming), but I think it will make the patch simpler. I'll try to hack up a new version of my patch. -Eli _______________________________________________ cfe-commits mailing list cfe-commits@cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits