Re: [I] turtle command wrongly warns about invalid characters in an IRI containing 4-byte unicode [jena]

via GitHub Sun, 13 Oct 2024 09:32:09 -0700


afs commented on issue #2766:
URL: https://github.com/apache/jena/issues/2766#issuecomment-2409042272


   Hi @mcb5637 - thank you for the report.
   
   Due to Java bytes to string conversion using the JDK standard charset, Jena 
can't see multibyte characters translated to surrogates (which is legal) and 
surrogates actually in the in UTF-8 (which is illegal - UTF-8 does not allow 
surrogates).
   
   As this would be silently legal in a string, #2769 removes the warning (it 
is only a warning - the code does insert the character into the IRI).
   
   A deeper solution is to use a UTF-8 processor that generates ints, not chars 
(and hence needs surrogates). But that is delicate. When written the tokenizer 
was faster using teh JDK built-in conversion; they may not be the case nowadays 
(java improvements, CPU architectures). But not a change to be done lightly.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] turtle command wrongly warns about invalid characters in an IRI containing 4-byte unicode [jena]

Reply via email to