Re: UTF8Encoder question...

Toshiyuki Kimura Wed, 19 Jan 2005 01:13:27 -0800

Hi Jongjin, (B (BLet me clarify ... (BIs the switch for only Admin Service and Client, for app global, (Bor for per each apps ? (B (B On the i18n point of view, I hope Axis works fine any time with (Ball of languages by using the default settings. (B (BThanks, (BToshi <[EMAIL PROTECTED]> (B (BOn Wed, 19 Jan 2005, Jongjin Choi wrote: (B (B> Hi, Toshi and all. (B> (B> I'd like to propose these for backward compatibility: (B> - keep the escaping as default (B> - make a runtime option (axis property in wsdd) for switching to (B> no-escaping. (B> (B> The current behavior has no problem for an application to handle the (B> soap message. I just pointed that the message size can be somewhat (B> larger with escaping. (B> (B> But in this case, the admin client (AdminClient.java) seems to writes (B> the content of soap body directly to console. I think the switch can (B> be applied to Admin Service and Client. (B> (B> Any thought? (B> (B> /Jongjin (B> (B> ----- Original Message ----- (B> From: "Toshiyuki Kimura" <[EMAIL PROTECTED]> (B> To: <[email protected]> (B> Cc: "Changshin Lee" <[EMAIL PROTECTED]>; (B> "Jongjin Choi" <[EMAIL PROTECTED]> (B> Sent: Wednesday, January 19, 2005 12:41 PM (B> Subject: Re: UTF8Encoder question... (B> (B> (B>> Hi Ias, Jongjin and all, (B>> (B>> Sorry for the cutting in. I'd like to know the conclusion. (B>> (B>> As you may know, I'm now working for i18n of Axis. Then, the (B>> Japanese Axis Community has already made a Japanized resources. (B>> On the testing, I faced an encoding problem of UTF-8. (B>> (B>> With the latest CVS codes, I get a escaping message from the (B>> server-side Axis as follows; (B>> (B>> <Admin>処理を実行しま (B>> した/ [en]-(Done processing)</Admin> (B>> (B>> instead of (B>> (B>> <Admin>[Japanese Message] / [en]-(Done processing)</Admin> (B>> (B>> As a side node, I could have valid Japanese characters when I (B>> applied a patch of Jongjin to my local 'UTF8Encoder.java'. (B>> (B>> Any thought? (B>> (B>> Regards, (B>> Toshi <[EMAIL PROTECTED]> (B>> (B>> On Thu, 30 Dec 2004, Changshin Lee wrote: (B>> (B>>>> Ias and all, (B>>>> (B>>>> If you revive the commented and removed code of UTF8Encoder that is : (B>>>> /* (B>>>> TODO: Try fixing this block instead of code above. (B>>>> if (character < 0x80) { (B>>>> writer.write(character); (B>>>> } else if (character < 0x800) { (B>>>> writer.write((0xC0 | character >> 6)); (B>>>> writer.write((0x80 | character & 0x3F)); (B>>>> } else if (character < 0x10000) { (B>>>> writer.write((0xE0 | character >> 12)); (B>>>> writer.write((0x80 | character >> 6 & 0x3F)); (B>>>> writer.write((0x80 | character & 0x3F)); (B>>>> } else if (character < 0x200000) { (B>>>> writer.write((0xF0 | character >> 18)); (B>>>> writer.write((0x80 | character >> 12 & 0x3F)); (B>>>> writer.write((0x80 | character >> 6 & 0x3F)); (B>>>> writer.write((0x80 | character & 0x3F)); (B>>>> } (B>>>> */ (B>>>> and uncommented current escaping code, the all-tests will fail. (B>>>> As I addressed, these code would be necessary for OutputStream not Writer. (B>>>> In this case the Writer is used and the code can be simply rewrited (as in UTF16Encoder) (B>>>> (B>>>> writer.write(character); (B>>>> (B>>>> I think the all-tests will succeed. (I can verify this now because current CVS all-tests fails.) (B>>>> (B>>> (B>>> Could you run all-tests except those failed chronically (by adding (B>>> them to excluded list)? If the result is clean, I'm on the change (and (B>>> it's easy to revert as well, so commit it :-). (B>>> (B>>>> For readability of SOAP message, I think it is not the responsibility of Axis. (B>>> (B>>> Human readability is one of essenses in XML (and SOAP). Assuming that (B>>> a SOAP processor processes a SOAP input message readable to a user, (B>>> then the output of the processing as a form of SOAP must be readable (B>>> to the user. Therefore when people use Axis as a SOAP processor, they (B>>> will blame Axis for a result containing unreadably broken characters (B>>> to them. It's not utterly up to Axis, but Axis can cause it, and Axis (B>>> should guarantee that there's no distortion in terms of readability (B>>> from Alpha to Omega of SOAP processing. (B>>> (B>>> Ias (B>>> (B>>>> (B>>>> This is the diff: (B>>>> cvs diff -u UTF8Encoder.java (B>>>> Index: UTF8Encoder.java (B>>>> =================================================================== (B>>>> RCS file: /home/cvspublic/ws-axis/java/src/org/apache/axis/components/encoding/UTF8Encoder.java,v (B>>>> retrieving revision 1.4 (B>>>> diff -u -r1.4 UTF8Encoder.java (B>>>> --- UTF8Encoder.java 4 Nov 2004 18:23:12 -0000 1.4 (B>>>> +++ UTF8Encoder.java 30 Dec 2004 01:20:03 -0000 (B>>>> @@ -82,10 +82,6 @@ (B>>>> "invalidXmlCharacter00", (B>>>> Integer.toHexString(character), (B>>>> xmlString)); (B>>>> - } else if (character > 0x7F) { (B>>>> - writer.write("&#x"); (B>>>> - writer.write(Integer.toHexString(character).toUpperCase()); (B>>>> - writer.write(";"); (B>>>> } else { (B>>>> writer.write(character); (B>>>> } (B>>>> (B>>>> (B>>>> /Jongjin (B>>>> (B>>>> ----- Original Message ----- (B>>>> From: "Changshin Lee" <[EMAIL PROTECTED]> (B>>>> To: <[email protected]> (B>>>> Sent: Thursday, December 30, 2004 1:20 AM (B>>>> Subject: Re: UTF8Encoder question... (B>>>> (B>>>>> Ias, (B>>>>> (B>>>>> Even if we consider the system which can't display the soap message well for its lack of unicode-font, (B>>>>> I think the default encoding should be as-it-is not scaping. (B>>>>> (B>>>>> The soap message is not for display and it is better to generate the more compact soap message from the web services toolkit's point of view. (B>>>>> (B>>>> (B>>>> SOAP messages are not for presentation but should be readable :-) (B>>>> (B>>>>> For displaying, the application can convert the soap message to appropriate encoding. (as you know, here in korea, we use euc-kr. and also as you know, the conversion can be possible with some line of java code.) (B>>>>> Also, as far as I know, Axis used as-it-is way in Axis 1.0 or 1.1. (B>>>>> (B>>>> (B>>>> That's a good point. However, we need to pay attention to those may (B>>>> want UTF8Encoder to run conversion like now. If we revert Axis 1.2's (B>>>> UTF8Encoder, we should inform users of the regression clearly in order (B>>>> not to puzzle them. (B>>>> (B>>>>> I remember that the reason to use scaping in UTF8Encoder was to handle the french accent or german umlaut a few months ago. This is reflected in test.encoding.TestString test case. (B>>>>> (B>>>> (B>>>> The current mechanism came up in April. At the moment (B>>>> (B>>>> TODO: Try fixing this block instead of code above. (B>>>> if (character < 0x80) { (B>>>> writer.write(character); (B>>>> } else if (character < 0x800) { (B>>>> writer.write((0xC0 | character >> 6)); (B>>>> writer.write((0x80 | character & 0x3F)); (B>>>> } else if (character < 0x10000) { (B>>>> writer.write((0xE0 | character >> 12)); (B>>>> writer.write((0x80 | character >> 6 & 0x3F)); (B>>>> writer.write((0x80 | character & 0x3F)); (B>>>> } else if (character < 0x200000) { (B>>>> writer.write((0xF0 | character >> 18)); (B>>>> writer.write((0x80 | character >> 12 & 0x3F)); (B>>>> writer.write((0x80 | character >> 6 & 0x3F)); (B>>>> writer.write((0x80 | character & 0x3F)); (B>>>> } (B>>>> */ (B>>>> (B>>>> but the commented part was gone in 1_2RC2 tag. (B>>>> (B>>>>> Any thought? (B>>>>> (B>>>> (B>>>> So, what you're saying is that the current UTF8Encoder's behavior (B>>>> comes from the test case. In other words, if you change the encoder to (B>>>> output "as-it-is", then the test fails. Could we make them consistent, (B>>>> I mean, UTF8Encoder outputs without conversion and at the same time (B>>>> the case passes? (B>>>> (B>>>> Ias (B>>>> (B>>>> P.S. I'd like to hear opinions on changing UTF8Encoder's default (B>>>> behavior (and possibly create another encoder or an option for (B>>>> conversion). Once we pass all tests with the changed encoder, it is (B>>>> worth adopting the change, I believe. (B>>>> (B>>>>> /Jongjin (B>>>>> (B>>>>> ----- Original Message ----- (B>>>>> From: "Ias" <[EMAIL PROTECTED]> (B>>>>> To: <[email protected]> (B>>>>> Sent: Wednesday, December 29, 2004 1:53 AM (B>>>>> Subject: RE: UTF8Encoder question... (B>>>>> (B>>>>>> (B>>>>>> From: Jongjin Choi [mailto:[EMAIL PROTECTED] (B>>>>>> Sent: Tuesday, December 28, 2004 11:56 AM (B>>>>>> To: [email protected] (B>>>>>> Subject: UTF8Encoder question... (B>>>>>> (B>>>>>> (B>>>>>> Dims and all, (B>>>>>> (B>>>>>> UTF8Encoder writes escaped string when the character is over 0x7F. (B>>>>>> The escaping does not seem to be necessary because (B>>>>>> the Writer (not OutputStream) is used. (B>>>>>> (B>>>>>> I think this could be just : (line 86) (B>>>>>> (B>>>>>> writer.write(character); (B>>>>>> (B>>>>>> instead of : (line 86 ~ 88) (B>>>>>> writer.write("&#x); (B>>>>>> writer.write(Integer.toHexString(character).toUpperCase()); (B>>>>>> writer.write(";"); (B>>>>>> (B>>>>>> The escaping just increases the message size. (B>>>>>> (B>>>>> ias> Yes, it does. However, I think representing a character of which codepoint (B>>>>> ias> is over 0x7F as a form of &#x XML entity is one of the aims of the encoder (B>>>>> ias> because some systems can't display that character properly due to no (B>>>>> ias> unicode-wide fonts built in there. In case it's 100% certain that every node (B>>>>> ias> in a messaging system has no problem with "as-it-is" character (B>>>>> ias> representation on a XML instance, it must be much more efficient to use a (B>>>>> ias> compact encoder as you pointed out instead of UTF8Encoder. Interestingly, (B>>>>> ias> AbstractXMLEncoder (which is not instantiable) works in such a way. In (B>>>>> ias> consequence, it would be a good idea to create a new encoder to optimize (B>>>>> ias> message size and use it with ease of configurability. (Yes, we can recommend (B>>>>> ias> it to users dealing with non-Latin character systems :-) (B>>>>>> (B>>>>>> Happy new year, (B>>>>>> (B>>>>>> Ias (B>>>>>> (B>>>>>> P.S. I'm going to switch [EMAIL PROTECTED] to [EMAIL PROTECTED] (soon, (B>>>>>> very soon). (B>>>>>> (B>>>>>> (B>>>>>> If the OutputStream is used, the escaping or UTF-8 conversion (which (B>>>>>> existed in old UTF8Encoder.java) will be needed. (B>>>>>> (B>>>>>> Thought? (B>>>>>> (B>>>>>> /Jongjin (B>>>>>> (B>>>>>> (B>>>> (B>>> (B>>

Re: UTF8Encoder question...

Reply via email to