Re: Two questions - BOM in UTF-8, and manually cleaning XML

Rodrigo Ruiz Wed, 05 Jul 2006 08:30:49 -0700

Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier.It seems like a demo example for a servlet filter ;-)


Hope this helps,
Rodrigo



Manuel Mall wrote:

On Wednesday 05 July 2006 23:12, Matthew Brown wrote:

Two bytes per char; Etherpeak is showing the second byte as 00.

Seems you are stuck between a "rock and a hard place" here. The bytestream appears to be correctly utf-16 encoded but the xml prolog saysutf-8. Not sure what to recommend. Fix it at the source is obvious butnot easily done. You may be able to write a handler that re-encodes thebyte stream into utf-8 before giving it to the Axis stacks. But how towrite such an Axis handler and how to hook it correctly into the Axisprocessing chain is outside my area of expertise.


May be someone else can give advice on how to attempt such a thing.

Manuel

-----Original Message-----
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 11:09 AM
To: [email protected]
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML

On Wednesday 05 July 2006 23:04, Matthew Brown wrote:

Manuel,

I believe you hit the problem on the head - the response prolog
says utf-8 but (according to Etherpeak) the BOM is ff/ef.
Coincidentally, by the time the response XML gets logged by axis,
these initial characters are logged as ef bf bd ef bf bd.

Matt,

what about the rest of the byte stream when you look at it in
Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
(1 byte per char for all typical ascii characters)?

Manuel

Unfortunately we may be in a bit of a tough place with having the
producer of the XML change it; the customer whose web services we
are consuming doesn't seem to see any issue with this (as they are
fine with their .NET tools).

If it is the case where we are seeing a UTF-16 BOM but a prolog
that declares UTF-8; is there any way to instruct Axis/Xerces to
parse it as UTF-16? Sorry if this question doesn't make much sense,
but I'm not too familiar with how Axis and/or Xerces decide which
character encoding to use when reading the XML.

Thanks again
Matt

-----Original Message-----
From: Manuel Mall [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 10:58 AM
To: [email protected]
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning
XML

On Wednesday 05 July 2006 22:16, Axel Bock wrote:

Yes, there is a work-around. It works if you encode the file with
UTF-8 (for example), and do not include the BOM at the beginning.
I use notepad++ for that task, where you can save in "UTF-8
without BOM".

The process for that is easy:
1. open the file in notepad++
2. mark everything via CTRL-A
3. cut (not copy!)
4. in the format menu, choose "ANSI" formatting and select "UTF
without BOM" at the bottom
5. paste
6. save.

that is a crap workaround, but works for me. for automatically
generated files ..... I dunno :-)


Greetings,
Axel.


On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> > wrote:

Hi all,

I hate to do this, but can anyone please help me with either of
these issues? I've tried to upgrade Xerces to 2.8.0 but to no
avail.

Is there anything else I could be doing?

Just wondering if your file in question starts with hex 'ef bb bf'
or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
believe you have an utf-16 encoded file (little endian or big
endian) not utf-8. If it is the 'ef bb bf' sequence then it starts
correctly with the utf-8 encoded unicode code point for BOM U+FEFF.
In all cases xerces should be able to handle it. A problem may
arise if it starts with 'ff ef' but the XML prolog says
encoding="utf-8" as that is a contradiction I believe.

I know this does not help directly but may help to check if the
problem is with the producer of the XML document or your consumer.

Manuel

What about the possibility of programmatically editing/cleaning
the response XML before it is given to the parser?

Thanks
Matt

-----Original Message-----
From: Matthew Brown [mailto: [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> ]
Sent: Saturday, July 01, 2006 12:41 PM
To: [email protected] <mailto:[email protected]>
Subject: Two questions - BOM in UTF-8, and manually cleaning XML


1. From searching the mailing list archives, I see several
references to people having problems with Byte Order Mark
characters appearing before the prolog in their UTF-8 messages.
However I can't seem to find much of a known resolution to these
issues. Is there a standard/common workaround for these BOM and
UTF-8 issues?

2. If there is no answer to my #1, is there anyway that Axis will
allow me to pragmatically edit the response XML before it is
passed to the parser and de-serialized? I've tried adding
Handlers, but I'm assuming that the Handler comes into the
picture after the message is parsed, because my Handler is only
ever seeing the request message, and not the response.

Thanks
Matt Brown

-------------------------------------------------------------------
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For
additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
-------------------------------------------------------------------
GRIDSYSTEMS                    Rodrigo Ruiz Aguayo
Parc Bit - Son Espanyol
07120 Palma de Mallorca        mailto:[EMAIL PROTECTED]
Baleares - España              Tel:+34-971435085 Fax:+34-971435082
http://www.gridsystems.com
-------------------------------------------------------------------


--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.9.9/382 - Release Date: 04/07/2006


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Two questions - BOM in UTF-8, and manually cleaning XML

Reply via email to