I added an input stream wrapper in the CVS code which handles detecting and processing the character encoding used for an input document supplied as an input stream. It turns out parser support for detecting and handling character encodings is optional with XMLPull, which I hadn't realized before. The approach I implemented will handle this independent of the parser.

I did also confirm and fix one error in the UTF-8 encoding, which effects character codes in the 0x800-0x3FFF range.

I'm hoping to avoid yet another release in the beta 3 series, so I'll probably just refer users to CVS if they need this support prior to beta 4.

 - Dennis

Dennis Sosnoski wrote:

Actually, I thought I'd noticed an error in the ISO-8859-1 code but on further examination it looks good (and works okay in my tests, too).

What *does* appear to be a problem is if you don't specify an encoding for an input stream that starts with an XML declaration specifying UTF-8 (<?xml version="1.0" encoding="UTF-8"?>). It looks like the parser is not correctly interpreting the input in this case. I'm investigating further, but thought I'd let people know the story.

If you know you're going to be working with UTF-8 documents, a workaround for now is to just specify the encoding when you set the input stream. That appears to work properly.

 - Dennis

HD wrote:

Ok I added the bug in the Jira with a simple JUnit testcase. The UTF-8 encoding fails with accents. I'm glad you found out the ISO issue because I can't reproduce it :-(

Henri.

HD 1meyrxd02-at-sneakemail.com |JiBX| wrote:

WIth UTF-8, it seems like when the XML file is read, the encoding is not taken into account and all UTF-8 escape characters are not translated backwards...
So I don't get the same bug as ISO-8859-1 but accents are not translated back into accents.


Henri.

Dennis Sosnoski dms-at-sosnoski.com |JiBX| wrote:

Actually, the problem I noticed is only for ISO-8859-1 - do you also see a problem when using UTF-8?

 - Dennis

Dennis Sosnoski wrote:

I see that there's an error in the encoding handling that I'd missed. Most of the test cases are just using ASCII characters, though I thought I had a few that went outside the set. I'll get it fixed in CVS as soon as I can, and will also add it to the test suite. If you can get a simple example code for this and attach it to a Jira issue I'll make sure it works properly for your data. Thanks,

 - Dennis

HD wrote:



I tried to use the UTF-8 and ISO-8859-1 encodings but there seems to be some strange things happening with the output encoding: all the french accents generate these strange characters.
For instance: rte st Antoine de Ginestière becomes
<CT_Adresse>rte Saint Antoine de GinestiÃ&#x0192;Æ&#x2019;Ã&#x2020;@&#x2122;Ã&#x0192;@ @D¢Ã&#x0192;Æ&#x2019;@Å¡Ã&#x0192;@&#x0161;Ã&#x201K;¨re</CT_Adresse>


This particular string was encoded with ISO-8859-1. But I get these strange characters too in UTF-8. I'm wondering what encoding was used to compile JiBX ?

Henri.



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
jibx-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/jibx-users

Reply via email to