Davanum,

I had tried this previously and the only effect that I noticed was that the 
encoding attribute of my request message's prolog changed. The response message 
was still being parsed as UTF-8 (which the headers had said) although it was 
truly 16.

Anyway, now that the service provider has changed their service to return true 
UTF-8 data, and Xerces still has trouble interpreting the UTF-8 BOM before the 
prolog, I have found a very hack-ish solution: Add a handler that will remove 
any characters in the currentMessage if the MessageContext is past the pivot. 
This doesn't feel like a great solution to me (why isn't the XML parser 
prepared to handle the BOM? Is the wrong parse method being used?), it works 
for us for right now.

Thanks for the help all
Matt

---------

package com.viecore.ipl.ws;

import javax.xml.soap.SOAPMessage;

import org.apache.axis.AxisFault;
import org.apache.axis.Message;
import org.apache.axis.MessageContext;
import org.apache.axis.SOAPPart;
import org.apache.axis.handlers.BasicHandler;
import org.apache.log4j.LogManager;
import org.apache.log4j.Logger;

public class MyHandler extends BasicHandler {

        private static Logger log = LogManager.getLogger(MyHandler.class);


        public void invoke(MessageContext messageContext) throws AxisFault {

                try {
                        if (log.isInfoEnabled()) log.info("invoke - start");
                        log.info("invoke - past pivot [" + 
messageContext.getPastPivot() + "]");

                        SOAPMessage rpcMsg = messageContext.getMessage();

                        if (rpcMsg instanceof Message) {
                                Message axisMsg = (Message) rpcMsg;

                                if (log.isDebugEnabled()) log.debug("invoke - 
cast java.xml.rpc.SOAPMessage to org.apache.axis.Message");

                                javax.xml.soap.SOAPPart rpcPart = 
axisMsg.getSOAPPart();
                                if (rpcPart instanceof SOAPPart) {
                                        SOAPPart axisPart = (SOAPPart) rpcPart;

                                        if (log.isDebugEnabled()) 
log.debug("invoke - cast java.xml.rpc.SOAPPart to org.apache.axis.SOAPPart");

                                        Object currentMessage = 
axisPart.getCurrentMessage();
                                        if (currentMessage == null) {
                                                log.debug("invoke - current 
message is null, cannot clean");
                                        }
                                        else {
                                                if (log.isDebugEnabled())
                                                        log.debug("invoke - 
current message of SOAP part has type [" + currentMessage.getClass().getName()
                                                                        + "] 
content [" + currentMessage.toString() + "]");

                                                // attempt to remove bad 
characters from the response
                                                if 
(messageContext.getPastPivot() == true) {

                                                        if (currentMessage 
instanceof String) {
                                                                String 
strMessage = (String) currentMessage;
                                                                int idx = 
strMessage.indexOf("<?xml");
                                                                if (idx == -1) {
                                                                        
log.warn("invoke - Could not find xml prolog in response message");
                                                                }
                                                                else {
                                                                        String 
cleaned = strMessage.substring(idx);

                                                                        
log.debug("invoke - Setting SOAPPart.currentMessage to: " + cleaned);

                                                                        
axisPart.setCurrentMessage(cleaned, axisPart.getCurrentForm());
                                                                }
                                                        }
                                                }
                                        }
                                }
                        }
                        if (log.isInfoEnabled()) log.info("invoke - complete");
                }
                catch (Exception ex) {
                        log.error("Caught exception in invoke()", ex);
                }
        }

}

-----Original Message-----
From: Davanum Srinivas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 05, 2006 3:41 PM
To: [email protected]
Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML


did you see my response on setting the CHARACTER_SET_ENCODING? what is
the exact stack trace you get on the client?

thanks,
dims

On 7/5/06, Matthew Brown <[EMAIL PROTECTED]> wrote:
> text/xml and utf-8, which I suppose explains the attempt to parse the UTF-16 
> message as UTF-8. The customer has changed the format of the message to 
> correctly be UTF-8 in actuality, although Xerces still isn't a fan of the 
> UTF-8 BOM (ef bb bf).
>
>
>
> -----Original Message-----
> From: Simon Fell [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 2:46 PM
> To: [email protected]
> Subject: RE: Two questions - BOM in UTF-8, and manually cleaning XML
>
>
> What does the content-type header say the charset is? That takes precedence 
> over the payload (at least for SOAP 1.1)
>
> Cheers
> Simon
>
> -----Original Message-----
> From: Rodrigo Ruiz [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 05, 2006 8:30 AM
> To: [email protected]
> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
>
> Maybe changing the xml prolog from "utf-8" to "utf-16" will be easier.
> It seems like a demo example for a servlet filter ;-)
>
>
> Hope this helps,
> Rodrigo
>
>
>
> Manuel Mall wrote:
> > On Wednesday 05 July 2006 23:12, Matthew Brown wrote:
> >> Two bytes per char; Etherpeak is showing the second byte as 00.
> >>
> > Seems you are stuck between a "rock and a hard place" here. The byte
> > stream appears to be correctly utf-16 encoded but the xml prolog says
> > utf-8. Not sure what to recommend. Fix it at the source is obvious but
> > not easily done. You may be able to write a handler that re-encodes
> > the byte stream into utf-8 before giving it to the Axis stacks. But
> > how to write such an Axis handler and how to hook it correctly into
> > the Axis processing chain is outside my area of expertise.
> >
> > May be someone else can give advice on how to attempt such a thing.
> >
> > Manuel
> >> -----Original Message-----
> >> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, July 05, 2006 11:09 AM
> >> To: [email protected]
> >> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
> >>
> >> On Wednesday 05 July 2006 23:04, Matthew Brown wrote:
> >>> Manuel,
> >>>
> >>> I believe you hit the problem on the head - the response prolog says
> >>> utf-8 but (according to Etherpeak) the BOM is ff/ef.
> >>> Coincidentally, by the time the response XML gets logged by axis,
> >>> these initial characters are logged as ef bf bd ef bf bd.
> >> Matt,
> >>
> >> what about the rest of the byte stream when you look at it in
> >> Etherpeak. Is it UTF-16 encoded (2 bytes per char) or UTF-8 encoded
> >> (1 byte per char for all typical ascii characters)?
> >>
> >> Manuel
> >>
> >>> Unfortunately we may be in a bit of a tough place with having the
> >>> producer of the XML change it; the customer whose web services we
> >>> are consuming doesn't seem to see any issue with this (as they are
> >>> fine with their .NET tools).
> >>>
> >>> If it is the case where we are seeing a UTF-16 BOM but a prolog that
> >>> declares UTF-8; is there any way to instruct Axis/Xerces to parse it
> >>> as UTF-16? Sorry if this question doesn't make much sense, but I'm
> >>> not too familiar with how Axis and/or Xerces decide which character
> >>> encoding to use when reading the XML.
> >>>
> >>> Thanks again
> >>> Matt
> >>>
> >>> -----Original Message-----
> >>> From: Manuel Mall [mailto:[EMAIL PROTECTED]
> >>> Sent: Wednesday, July 05, 2006 10:58 AM
> >>> To: [email protected]
> >>> Subject: Re: Two questions - BOM in UTF-8, and manually cleaning XML
> >>>
> >>> On Wednesday 05 July 2006 22:16, Axel Bock wrote:
> >>>> Yes, there is a work-around. It works if you encode the file with
> >>>> UTF-8 (for example), and do not include the BOM at the beginning.
> >>>> I use notepad++ for that task, where you can save in "UTF-8 without
> >>>> BOM".
> >>>>
> >>>> The process for that is easy:
> >>>> 1. open the file in notepad++
> >>>> 2. mark everything via CTRL-A
> >>>> 3. cut (not copy!)
> >>>> 4. in the format menu, choose "ANSI" formatting and select "UTF
> >>>> without BOM" at the bottom 5. paste 6. save.
> >>>>
> >>>> that is a crap workaround, but works for me. for automatically
> >>>> generated files ..... I dunno :-)
> >>>>
> >>>>
> >>>> Greetings,
> >>>> Axel.
> >>>>
> >>>>
> >>>> On 7/5/06, Matthew Brown < [EMAIL PROTECTED]
> >>>> <mailto:[EMAIL PROTECTED]> > wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I hate to do this, but can anyone please help me with either of
> >>>> these issues? I've tried to upgrade Xerces to 2.8.0 but to no
> >>>> avail.
> >>>>
> >>>> Is there anything else I could be doing?
> >>> Just wondering if your file in question starts with hex 'ef bb bf'
> >>> or 'ff ef' or 'ef ff'. If it is one of the latter two forms I
> >>> believe you have an utf-16 encoded file (little endian or big
> >>> endian) not utf-8. If it is the 'ef bb bf' sequence then it starts
> >>> correctly with the utf-8 encoded unicode code point for BOM U+FEFF.
> >>> In all cases xerces should be able to handle it. A problem may arise
> >>> if it starts with 'ff ef' but the XML prolog says encoding="utf-8"
> >>> as that is a contradiction I believe.
> >>>
> >>> I know this does not help directly but may help to check if the
> >>> problem is with the producer of the XML document or your consumer.
> >>>
> >>> Manuel
> >>>
> >>>> What about the possibility of programmatically editing/cleaning the
> >>>> response XML before it is given to the parser?
> >>>>
> >>>> Thanks
> >>>> Matt
> >>>>
> >>>> -----Original Message-----
> >>>> From: Matthew Brown [mailto: [EMAIL PROTECTED]
> >>>> <mailto:[EMAIL PROTECTED]> ]
> >>>> Sent: Saturday, July 01, 2006 12:41 PM
> >>>> To: [email protected] <mailto:[email protected]>
> >>>> Subject: Two questions - BOM in UTF-8, and manually cleaning XML
> >>>>
> >>>>
> >>>> 1. From searching the mailing list archives, I see several
> >>>> references to people having problems with Byte Order Mark
> >>>> characters appearing before the prolog in their UTF-8 messages.
> >>>> However I can't seem to find much of a known resolution to these
> >>>> issues. Is there a standard/common workaround for these BOM and
> >>>> UTF-8 issues?
> >>>>
> >>>> 2. If there is no answer to my #1, is there anyway that Axis will
> >>>> allow me to pragmatically edit the response XML before it is passed
> >>>> to the parser and de-serialized? I've tried adding Handlers, but
> >>>> I'm assuming that the Handler comes into the picture after the
> >>>> message is parsed, because my Handler is only ever seeing the
> >>>> request message, and not the response.
> >>>>
> >>>> Thanks
> >>>> Matt Brown
> >>> -------------------------------------------------------------------
> >>> -- To unsubscribe, e-mail: [EMAIL PROTECTED] For
> >>> additional commands, e-mail: [EMAIL PROTECTED]
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
> --
> -------------------------------------------------------------------
> GRIDSYSTEMS                    Rodrigo Ruiz Aguayo
> Parc Bit - Son Espanyol
> 07120 Palma de Mallorca        mailto:[EMAIL PROTECTED]
> Baleares - EspaƱa              Tel:+34-971435085 Fax:+34-971435082
> http://www.gridsystems.com
> -------------------------------------------------------------------
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.1.394 / Virus Database: 268.9.9/382 - Release Date: 04/07/2006
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Davanum Srinivas : http://www.wso2.net (Oxygen for Web Service Developers)

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to