How about this approach?
- LexUnicode mirrors LexTokenInternal, dispatching to the proper lex method 
based on the first Unicode character in a token.
- UCNs are validated in readUCN (called by LexTokenInternal and LexIdentifier). 
The specific identifier restrictions are checked in LexUnicode and 
LexIdentifier.
- UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we start 
with the spelling info there, but all the validation has already happened.

With these known flaws:
- the classification of characters in LexUnicode should be more efficient.
- poor recovery for a non-identifier UCN in an identifier. Right now I just 
take that to mean "end of identifier", which is the most pedantically correct 
thing to do, but it's probably not what's intended.
- still needs more tests, of course

FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a 
great idea. The case where it matters most (validation of identifier 
characters) is pretty easy to separate out into a helper function (and indeed 
it already is). The other cases (accepting Unicode whitespace and fixits for 
accidental Unicode) only make sense for literal Unicode, not escaped Unicode.

Anyway, what do you think?
Jordan

Attachment: UCNs.patch
Description: Binary data

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Reply via email to