Hi Hen, I have more questions than answers...
On Tue, Jul 19, 2011 at 12:35 PM, Henri Yandell <[email protected]> wrote: > > So you're not saying that we have to escape > 0x7f (old behaviour), Yeah, the way I read the W3C site, I thought we'd need to escape code points > 65,536 (above the BMP) > > but that we have to escape any supplementary characters? Yes, in particular, an esacped code point > 65,536 must be escaped with one escape (𣎴 rather than ��) The way I read the site is that IF you are going to escape > 65,536, then you MUST use a single code point value. What is not clear to me yet is if/when you must escape > 65,536. The XML 1.0 spec reads: [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ So does that mean that we should make sure we do NOT escape an XML Char (aside from & > < and so on?) Then what about XML 1.1? The XML 1.1 spec reads: [2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */ [2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F] The more I look at this the more it is confusing! Gary > > Hen > > On Tue, Jul 19, 2011 at 7:28 AM, Gary Gregory > <[email protected]> wrote: > > Hi All: > > > > I am glad to know there is a 3.0 way of doing that, which is: > > > > @Test > > public void testEscapeXmlSupplementaryCharacters() { > > CharSequenceTranslator escapeXml = > > StringEscapeUtils.ESCAPE_XML.with( > > NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) ); > > > > assertEquals("Supplementary character must be represented using a > > single escape", "𣎴", > > escapeXml.translate("\uD84C\uDFB4")); > > > > but what about the test the way it was originally written? > > > > // Example from https://issues.apache.org/jira/browse/LANG-728 > > assertEquals("Supplementary character must be represented using a > > single escape", "𣎴", > > StringEscapeUtils.escapeXml("\uD84C\uDFB4")); > > // Example from See > > http://www.w3.org/International/questions/qa-escapes > > assertEquals("Supplementary character must be represented using a > > single escape", "𣎴", > > StringEscapeUtils.escapeXml("\uD84C;\uDFB4;")); > > > > It still fails. > > > > Shouldn't the API be changed to work for this case too? The W3C seems to > > say so: "you must use the single, code point value for that character" in: > > > > * From http://www.w3.org/International/questions/qa-escapes > > * </p> > > * <blockquote> > > * Supplementary characters are those Unicode characters that have code > > points higher than the characters in > > * the Basic Multilingual Plane (BMP). In UTF-16 a supplementary > > character is encoded using two 16-bit surrogate code points from the > > * BMP. Because of this, some people think that supplementary characters > > need to be represented using two escapes, but this is incorrect > > * – you must use the single, code point value for that character. For > > example, use 𣎴 rather than ��. > > * </blockquote> > > > > Gary > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] > > Sent: Tuesday, July 19, 2011 0:58 AM > > To: [email protected] > > Subject: svn commit: r1148162 - > > /commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java > > > > Author: bayard > > Date: Tue Jul 19 04:58:03 2011 > > New Revision: 1148162 > > > > URL: http://svn.apache.org/viewvc?rev=1148162&view=rev > > Log: > > Updating unit test for LANG-728 to work with Lang 3.0 way of using > > escapeXml with > 0x7f characters > > > > Modified: > > > > commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java > > > > Modified: > > commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java > > URL: > > http://svn.apache.org/viewvc/commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java?rev=1148162&r1=1148161&r2=1148162&view=diff > > ============================================================================== > > --- > > commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/StringEscapeUtilsTest.java > > (original) > > +++ commons/proper/lang/trunk/src/test/java/org/apache/commons/lang3/Str > > +++ ingEscapeUtilsTest.java Tue Jul 19 04:58:03 2011 > > @@ -31,6 +31,9 @@ import org.apache.commons.io.IOUtils; import > > org.junit.Ignore; import org.junit.Test; > > > > +import org.apache.commons.lang3.text.translate.CharSequenceTranslator; > > +import org.apache.commons.lang3.text.translate.UnicodeEscaper; > > + > > /** > > * Unit tests for {@link StringEscapeUtils}. > > * > > @@ -333,15 +336,13 @@ public class StringEscapeUtilsTest { > > * @see <a > > href="http://www.w3.org/International/questions/qa-escapes">Using character > > escapes in markup and CSS</a> > > * @see <a > > href="https://issues.apache.org/jira/browse/LANG-728">LANG-728</a> > > */ > > - @Ignore > > @Test > > public void testEscapeXmlSupplementaryCharacters() { > > - // Example from https://issues.apache.org/jira/browse/LANG-728 > > - assertEquals("Supplementary character must be represented using a > > single escape", "𣎴", > > - StringEscapeUtils.escapeXml("\uD84C\uDFB4")); > > - // Example from See > > http://www.w3.org/International/questions/qa-escapes > > - assertEquals("Supplementary character must be represented using a > > single escape", "𣎴", > > - StringEscapeUtils.escapeXml("\uD84C;\uDFB4;")); > > + CharSequenceTranslator escapeXml = > > + StringEscapeUtils.ESCAPE_XML.with( > > + UnicodeEscaper.between(0x7f, Integer.MAX_VALUE) ); > > + > > + assertEquals("Supplementary character must be represented using a > > single escape", "\u233B4", > > + escapeXml.translate("\uD84C\uDFB4")); > > } > > > > // Tests issue #38569 > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -- Thank you, Gary http://garygregory.wordpress.com/ http://garygregory.com/ http://people.apache.org/~ggregory/ http://twitter.com/GaryGregory --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
