afs commented on issue #1324: URL: https://github.com/apache/jena/issues/1324#issuecomment-1132126541
I don't see a PR on the javacc issue that is suitable. There is an interesting suggestion about lexical states. ARQ only parses from strings, not streams, and only from data already already converted UTF-8. Access to the input would enable slicing literals direly out of the string. Rather than disrupt the existing processing, it could be done with a new token e.g. `X"...."`. USER_CHAR_STREAM is also an option. There is some investigation to do such as updating for Javacc 7.0 (the Jena codebase files were produced from JavaCC 6.0). #1328. FYI: The different parsers use different techniques to handle unicode and it is in some tests about surrogate pairs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
