[jira] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-23 Thread Joe Kesselman (Jira)


[ https://issues.apache.org/jira/browse/XALANJ-2419 ]


Joe Kesselman deleted comment on XALANJ-2419:
---

was (Author: JIRAUSER285361):
Max's alternative does cause a regression in some of the new tests, assuming I 
applied it correctly. Surprising. Can take a longer look, but may want to merge 
what we have first since it *is* an improvement over the previous code.




 

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> 

[jira] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-20 Thread Joseph Kessselman (Jira)


[ https://issues.apache.org/jira/browse/XALANJ-2419 ]


Joseph Kessselman deleted comment on XALANJ-2419:
---

was (Author: jkesselm):
(apitest rather than smoketest, but it's there.)

Seeing a few oddities in astrals. Thought I had that running. Investigating.

I still need to look at @max's ToStream buffer-bounds tweak and see if that 
still applies. And at whether it ought to be replicated in 
ToXMLStream/ToHTMLStream to replace their handling of the surrogate-pair case; 
arguably so...?

 

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>