maxfortun commented on code in PR #166:
URL: https://github.com/apache/xalan-java/pull/166#discussion_r1465445384


##########
serializer/src/main/java/org/apache/xml/serializer/ToStream.java:
##########
@@ -1595,23 +1599,40 @@ else if (m_encodingInfo.isInEncoding(ch)) {
                         // not in the normal ASCII range, we also
                         // just leave it get added on to the clean characters
                     }
-                    else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 
&& Encodings.isLowUTF16Surrogate(chars[i+1])) {
-                       // So, this is a (valid) surrogate pair
-                       if (! m_encodingInfo.isInEncoding(ch, chars[i+1])) {
-                               int codepoint = Encodings.toCodePoint(ch, 
chars[i+1]);
-                               writeOutCleanChars(chars, i, 
lastDirtyCharProcessed);
-                               writer.write("&#");
-                               writer.write(Integer.toString(codepoint));
-                               writer.write(';');
-                               lastDirtyCharProcessed = i+1;
-                       }
-                       i++; // skip the low surrogate, too
+                    else if (Encodings.isHighUTF16Surrogate(ch)) {
+                        // Store for later processing. We may be at the end of 
a buffer,
+                                               // and must wait till low 
surrogate arrives
+                                               // before we can do anything 
with this.
+                        writeOutCleanChars(chars, i, lastDirtyCharProcessed);
+                        m_highUTF16Surrogate = ch;
+                        lastDirtyCharProcessed = i;
+                    }
+                    else if (m_highUTF16Surrogate != 0 && 
Encodings.isLowUTF16Surrogate(ch)) {
+                        // The complete utf16 byte sequence is now available 
and may be serialized.
+                       if (! m_encodingInfo.isInEncoding(m_highUTF16Surrogate, 
ch)) {

Review Comment:
   1602: If ch is high surrogate, dump clean chars and retain ch. This works 
within the same buffer or across buffer boundaries
   1610: if we have a retained high surrogate and ch is a valid low surrogate, 
we have 2 use cases. 
       1. encoding doesn't support the multibyte char, escape entity.
       2. encoding does support the multibyte char, output chars as is.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org

Reply via email to