On Tue, Nov 27, 2012 at 5:04 PM, Eli Friedman <eli.fried...@gmail.com> wrote: > On Tue, Nov 27, 2012 at 3:33 PM, Eli Friedman <eli.fried...@gmail.com> wrote: >> On Tue, Nov 27, 2012 at 3:01 PM, Richard Smith <rich...@metafoo.co.uk> wrote: >>> On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <eli.fried...@gmail.com> >>> wrote: >>>> >>>> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <rich...@metafoo.co.uk> >>>> wrote: >>>> > I had a look at supporting UTF-8 in source files, and came up with the >>>> > attached approach. getCharAndSize maps UTF-8 characters down to a char >>>> > with >>>> > the high bit set, representing the class of the character rather than >>>> > the >>>> > character itself. (I've not done any performance measurements yet, and >>>> > the >>>> > patch is generally far from being ready for review). >>>> > >>>> > Have you considered using a similar approach for lexing UCNs? We already >>>> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with >>>> > them >>>> > there. Also, validating the codepoints early would allow us to recover >>>> > better (for instance, from UCNs encoding whitespace or elements of the >>>> > basic >>>> > source character set). >>>> >>>> That would affect the spelling of the tokens, and I don't think the C >>>> or C++ standard actually allows us to do that. >>> >>> >>> If I understand you correctly, you're concerned that we would get the wrong >>> string in the token's spelling? When we build a token, we take the >>> characters from the underlying source buffer, not the value returned by >>> getCharAndSize. >> >> Oh, I see... so the idea is to hack up getCharAndSize instead of >> calling isUCNAfterSlash/ConsumeUCNAfterSlash where we expect a UCN, >> use a marker which essentially means "saw a UCN". >> >> Seems like a workable approach; I don't think it actually helps any >> with error recovery (I'm pretty sure we can't diagnose anything >> without knowing what kind of token we're forming), but I think it will >> make the patch simpler. I'll try to hack up a new version of my >> patch. > > Attached.
And, I've discovered a rather large weakness of this approach: actually writing a correct implementation of getCharAndSizeSlow which returns a special value for UCNs is painful at best. I might have to abandon this route. -Eli _______________________________________________ cfe-commits mailing list cfe-commits@cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits