[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Joe Kesselman (Jira) Tue, 23 Jan 2024 14:13:03 -0800


    [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810117#comment-17810117
 ]


Joe Kesselman commented on XALANJ-2419:
---------------------------------------

[~cdamioli] : I believe that was due to some mistakes in how the release was 
handled, and the awkward juggling needed to correct those mistakes.

*Master* is always supposed to be our primary development branch. New code may 
be developed on other branches but isn't official until it is merged into 
{*}Master{*},

When a release is made, a tag or fork is created for that release number. Thus, 
there should be branches/tags for {*}2.7.1{*}, {*}2.7.2{*}, and *2.7.3* (along 
with older checkpoints).

If hot fixes are needed which must be applied to code that has already been 
released (rather than just being included in the next release), we may create 
*maint* branches where the change is back-ported to the earlier versions.  
Essentially *2.7.1.maint* is the "development master" for *2.7.1.1.* This does 
_not_ mean *Master* should be derived from *maint* branches. It does mean that 
if something is fixed in an old release, Master should also be fixed  – but due 
to code evolution over time, the fix may not be identical, and *maint* is not 
one of *Master's* dependencies, so that must be done manually.

I believe that what I've just described is standard SCCS "best practice". It's 
certainly how we managed Xalan (mumble) years ago before I dropped out of it.

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Assignee: Joe Kesselman
>            Priority: Major
>             Fix For: The Latest Development Code
>
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Reply via email to