Here is an excerpt from the XML 1.1 spec (http://www.w3.org/TR/xml11/):

--
The ampersand character (&) and the left angle bracket (<) MUST NOT appear in 
their literal form, except when used as markup delimiters, or within a comment, 
a processing instruction, or a CDATA section. If they are needed elsewhere, 
they MUST be escaped using either numeric character references or the strings 
"&amp;" and "&lt;" respectively. The right angle bracket (>) MAY be represented 
using the string "&gt;", and MUST, for compatibility, be escaped using either 
"&gt;" or a character reference when it appears in the string "]]>" in content, 
when that string is not marking the end of a CDATA section.

In the content of elements, character data is any string of characters which 
does not contain the start-delimiter of any markup or the CDATA-section-close 
delimiter, "]]>". In a CDATA section, character data is any string of 
characters not including the CDATA-section-close delimiter.

To allow attribute values to contain both single and double quotes, the 
apostrophe or single-quote character (') MAY be represented as "&apos;", and 
the double-quote character (") as "&quot;".
--

Here is how I read this for our use:

The escapeXml method (IMO) is meant to produce the contents of XML elements and 
attributes. In order to produce valid XML content for an attribute or an 
element, the & and < characters must be escaped. For compatibility the > 
character must also be escaped when part ofg "]]>". The tricky part is what to 
do with single and double quote characters. When the content is for an XML 
element, you not need do anything. When the content is for an XML attribute you 
need to know if the attribute is delimited with a single or double quote in 
order to only escape what is needed. I would not want to produce an overly 
escaped string.

So, "low" or "high" characters should not be escaped.

All of this leads me to think that we should deprecate escapeXml and create: 
escapeXmlElementContent(String) and escapeXmlAttributeContents(String, char) 
where the char denotes which quote character to escape.

Gary

> -----Original Message-----
> From: Henri Yandell [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, April 18, 2006 9:55 AM
> To: Jakarta Commons Users List
> Subject: Re: [Lang] escapeXML() -> Not escaping low characters
> 
> On 3/31/06, David López Muñoz <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > I'm trying to escape some texts to be xml-valid and I'm using
> StringEscapeUtils.escapeXml().
> >
> > I found a problem with low characteres such as #18. They don't seem to be
> escaped, and therefore they are mixed together with other characteres as if
> there were normal characteres such as 'a', '1' etc.
> >
> > Am I doing sth wrong? I'm using commons-lang 2.1. Is it a known bug already
> solved in newer versions?
> 
> Sorry for lack of reply. Definitely not fixed yet, and thanks for
> reporting it in bugzilla. There's another bug that complains that high
> characters ARE getting escaped - so definitely something that's up for
> debate :)
> 
> Would all low-chars want to be escaped? I suspect that people wouldn't
> want newlines suddenly being escaped and turning the xml into a single
> line. Anyone got any idea if the XML spec even talks about low-chars?
> 
> Hen
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to