afs commented on issue #2766:
URL: https://github.com/apache/jena/issues/2766#issuecomment-2409042272

   Hi @mcb5637 - thank you for the report.
   
   Due to Java bytes to string conversion using the JDK standard charset, Jena 
can't see multibyte characters translated to surrogates (which is legal) and 
surrogates actually in the in UTF-8 (which is illegal - UTF-8 does not allow 
surrogates).
   
   As this would be silently legal in a string, #2769 removes the warning (it 
is only a warning - the code does insert the character into the IRI).
   
   A deeper solution is to use a UTF-8 processor that generates ints, not chars 
(and hence needs surrogates). But that is delicate. When written the tokenizer 
was faster using teh JDK built-in conversion; they may not be the case nowadays 
(java improvements, CPU architectures). But not a change to be done lightly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to