Re: XmlPullParser parses strings with platform's default charset

Martin Grigorov Wed, 06 Jun 2012 03:33:23 -0700

Hi Juergen,

Thanks for the explanation!


I've tried all combinations of the following variables:
- -Dfile.encoding=latin1
- with and without <?xml encoding="utf-8"?> in the String to parse
- parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), null);
- parse(new ByteArrayInputStream(string.toString().getBytes()), "UTF-8");
- parse(new ByteArrayInputStream(string.toString().getBytes("UTF-8")), "UTF-8");

and the test passes only when the String has the prolog with the
encoding and "parse(new
ByteArrayInputStream(string.toString().getBytes("UTF-8")), "UTF-8");"
is used
any other combination produces mangled characters and the assertion fails

So I cannot find a stable solution that will work on any environment.
We can use IRequestCycleSettings#getResponseRequestEncoding() for the
charset but if there is no XML prolog or it has no encoding attr then
the test fails.

On Tue, Jun 5, 2012 at 11:53 PM, Juergen Donnerstag
<[email protected]> wrote:
> Hi Martin,
>
> XmlReader reads the markup file, interprets <?xml encoding ..> if
> present, and converts the markup into a String, which in Java is
> always UTF encoded. XmlPullParser uses the data provided by XmlReader.
>
> To support unit testing XPP provide a parse(String) method which
> encapsulates the string into a inputstream, in order not to circumvent
> XmlReader for testing.
>
> No xml decl (or no encoding) results in XmlReader using the JVM
> default, which if the OS default not provided via -Dfile.encoding=
>
> And since you never know on which OS in which country devs a building
> or testing, providing the UTF encoded value is the save way of doing
> it.
>
> We may replace parse(string) with parse(string, "encoding") which
> seems to be supported by all underlying methods, but are preset with
> null (JVM default) right now. That may help you solve your problem,
> and make other devs aware that the encoding might need change.
>
> make sense?
>
> Juergen
>
> On Tue, Jun 5, 2012 at 9:54 AM, Juergen Donnerstag
> <[email protected]> wrote:
>> I'll have a look later today.
>>
>> Juergen
>>
>> On Mon, Jun 4, 2012 at 3:37 PM, Martin Grigorov
>> <[email protected]> wrote:
>>> Hi,
>>>
>>> I'm not quite sure but I think there is a bug in
>>> org.apache.wicket.markup.parser.XmlPullParser#parse(CharSequence)
>>> because it uses
>>> string.toString().getBytes() to create a ByteArrayInputStream.
>>>
>>> org.apache.wicket.util.tester.BaseWicketTester#getTagById(String) uses
>>> lastResponseAsString to feed XmlPullParser but lastResponseAsString's
>>> encoding depends on
>>> org.apache.wicket.settings.IRequestCycleSettings#getResponseRequestEncoding().
>>> I.e. the string may be encoded in UTF-8 but later XmlPullParser will
>>> try to process its bytes as Windows-1252 for example.
>>>
>>>
>>> Here is a small patch that exposes the problem:
>>> diff --git 
>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>> b/wicket-core/src/test/java/org/apache/wicket/markup/p
>>> index 2e26d05..15fb496 100644
>>> --- 
>>> a/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>> +++ 
>>> b/wicket-core/src/test/java/org/apache/wicket/markup/parser/XmlPullParserTest.java
>>> @@ -191,6 +191,13 @@ public class XmlPullParserTest extends Assert
>>>                assertNull(parser.getEncoding());
>>>                tag = parser.nextTag();
>>>                assertNull(tag);
>>> +
>>> +               String expected = "äöü€";
>>> +               parser.parse("<dummy>"+expected+"</dummy>");
>>> +               XmlTag openTag = parser.nextTag();
>>> +               XmlTag closeTag = parser.nextTag();
>>> +               String actual = parser.getInput(openTag.getPos() +
>>> openTag.getLength(), closeTag.getPos()).toString();
>>> +               assertEquals(expected, actual);
>>>        }
>>>
>>>        /**
>>>
>>> Apply this patch and run the test with -Dfile.encoding=latin1. It will
>>> fail in the comparison. Run it with UTF-8 and it will pass.
>>>
>>> I remember Juergen had similar problem with one of Wicket core tests
>>> that uses the Euro sign in an assertion and he fixed it by using
>>> unicode escaped value (\uabcd).
>>> But in this case the response is encoded with whatever is configured
>>> at IRequestCycleSettings#getResponseRequestEncoding() and
>>> XmlPullParser tries to read it with the platform default encoding.
>>>
>>> Is this a bug and how we can solve it ?
>>>
>>> --
>>> Martin Grigorov
>>> jWeekend
>>> Training, Consulting, Development
>>> http://jWeekend.com



-- 
Martin Grigorov
jWeekend
Training, Consulting, Development
http://jWeekend.com

Re: XmlPullParser parses strings with platform's default charset

Reply via email to