afs commented on issue #2766: URL: https://github.com/apache/jena/issues/2766#issuecomment-2409042272
Hi @mcb5637 - thank you for the report. Due to Java bytes to string conversion using the JDK standard charset, Jena can't see multibyte characters translated to surrogates (which is legal) and surrogates actually in the in UTF-8 (which is illegal - UTF-8 does not allow surrogates). As this would be silently legal in a string, #2769 removes the warning (it is only a warning - the code does insert the character into the IRI). A deeper solution is to use a UTF-8 processor that generates ints, not chars (and hence needs surrogates). But that is delicate. When written the tokenizer was faster using teh JDK built-in conversion; they may not be the case nowadays (java improvements, CPU architectures). But not a change to be done lightly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
