Hans Aberg <[email protected]> writes: > On 1 Jan 2012, at 21:06, David Kastrup wrote: > >>>> Updates: >>>> Labels: Patch-new >>>> >>>> Comment #2 on issue 2159 by [email protected]: Patch: lexer.ll: Warn >>>> about non-UTF-8 characters >>>> http://code.google.com/p/lilypond/issues/detail?id=2159#c2 >>>> >>>> lexer.ll: Warn about non-UTF-8 characters >>>> >>>> Making the warnings point to the exact bad byte rather than the >>>> enclosing construct would be nice. >>> >>> One way to implement this might be to use the Haskell program for Flex >>> like UTF-8 regular expressions I made: >>> http://xcybercloud.blogspot.com/2009/04/unicode-support-in-flex.html >>> >>> First make rules for the Unicode characters you want admit, followed >>> by a '.' rule which picks up single excluded bytes. >> >> The "unicode characters we want admit" are not single characters, but >> part of things like identifiers, strings and other stuff. Cf. >> <URL:http://codereview.appspot.com/5505090#msg5> >> for a reasoning about the current approach for this patch. > > I translate Unicode character classes into Flex UTF-8 regular > expressions, so you can apply the other Flex regex operators to get > that stuff.
What makes you think I did not get that? Did you actually _read_ the reasoning I linked to above? You don't get a single error path in that case, and doing a catchall with . requires _backing_ _up_ in the lexer for every non-UTF-8 byte sequence that does not already start with an invalid byte. We use uncompressed tables in the lexer and make it a point to have _no_ expressions backing up. So you need to provide expressions matching any _bad_ UTF-8 sequence even if its first bytes are identical to that of a good UTF-8 sequence. Please try understanding this problem before suggesting a non-fitting solution again. I have spent days with doing analysis and trying alternative approaches, and it is somewhat aggravating if somebody just goes on assuming I don't know what I am talking about and showing me a simplistic solution often enough will make me realize my stupidity. Please run lex -b on a flex file of yours that checks for UTF-8 identifiers and check whether you get any backup states in the resulting lex.backup file. I should be quite surprised if you didn't. -- David Kastrup _______________________________________________ bug-lilypond mailing list [email protected] https://lists.gnu.org/mailman/listinfo/bug-lilypond
