Re: i18n bug nailed down!

Santiago Gala Sat, 18 Nov 2000 03:42:19 -0800


Jon Stevens wrote:

> on 11/17/2000 4:32 PM, "Santiago Gala" <[EMAIL PROTECTED]> wrote:
>
> > I had to patch org.apache.ecs.GenericElement.java, which plainly did not know
> > how to
> > convert multibyte characters back to a String.
> >
> > As I'm not involved in ECS, I put here the patch and I send this message to
> > the ecs
> > list and Jon Stevens. It is not very clean, but the principle is: never use a
> > ByteArray to write characters to, since you will loose the high byte. I have
> > tested
> > the changes, and I have found no problem.
> >
> > To check for the problem, start Jetspeed, wait until feed are processed, and
> > search
> > "javable". There should be two channels, one in English and one in Russian.
> > The
> > Russian one should be filled with "?".
> >
> > After the patch, It should display plenty of Cyrillic characters.
>
> Actually no, I think that you are wrong.
>
> If you pass the correct encoding to the toString() method of a
> ByteArrayOutputStream, the Javadoc clearly states that it does the
> translation between bytes and characters. ECS is clearly doing this
> correctly. It appears as though you are not though.
>

That is true for reading a ByteStream, but the problem is when ECS writes it.

Let's imagine that you have a StringElement with high byte characters.

When you call toString() or output(), what ECS does is:

It creates a ByteArrayOutputStream, and calls output(bos) on this stream.

So, let's analyze this routine:

It takes no encoding as parameter, so we are getting wrong here. Do you already see
that? If we are trying to encode as bytes, we should pass the encoding, and use
getBytes(enc) all the way. In fact String.getBytes() is deprecated.

This routine will call getBytes() for some Strings with no encoding. Wrong again.
We should use whatever encoding is needed to preserve high bytes.

What is even worse, at the end of this process, it will reconstruct the String,
from the ByteArray, putting zeros in the high order byte. So it is a nonsense trip
to nowhere.

I think the problem stems from a confusion between getBytes() and getString(). If
you change the method name to toBytes and toBytes(encoding), and add the encoding
as a parameter to output(ByteArrayOutputStream) and use it, it will probably to
what is trying to do.

My version (since we are getting a String starting from objects that are
collections of Strings), is to use Strings all the way. So, I create a
StringWriter, wrap it in a PrintWriter, and output on this Writer. No encodings are
involved, since we are going from String to String. It should actually be much
faster, in addition.

When you call toString(encoding), the theory is good. But what happens is:

- toString(encoding) calls output(ByteArrayOutputStream), again losing the
encoding. This is the piece that is screwing the whole thing. Even if the
implemetation of top level stuff is right in principle, it will lose the encoding
as soon as an Element encapsulates another.


>
> Given that people have been using ECS for quite some time and haven't
> complained yet so far of I8n issues within ECS and I have heard of MANY
> people using Turbine with I8N without any problems and Turbine's core
> revolves around I8N (Jyve is another example of a fully I8N localized
> application based 100% on top of ECS), I think that this is clearly a
> problem on your end.

Is people using it with multibyte encodings (I'm using utf-8)? I would bet that if
anybody is using it in Japan, it would have had to patch it in similar ways. I
would be happy if some people comes and say that they are using it in Japanese or
Chinese, with ecs, and they have no problems.

I have been fighting with this bug in excess of a month (not full time, but itching
me :-)). I have put lines looking for a concrete Russian character before an after
storing the channel in ECS. Before, it works, after it does not.

Try something along the lines of:


import org.apache.ecs.*;

/**
 * Test for multibyte character handling in ECS
 * @author <A HREF="mailto:[EMAIL PROTECTED]">S. Gala</A>
 */
public class TestECSPrinting {

public static void main( String[] argv ) {

 String test = "" + (char)0x0432; // Small RUSSIAN B in Unicode
 StringElement testecs = new StringElement(test);
 if( test.indexOf( test ) >= 0 )
    System.err.println("It contains SMALL RUSSIAN B before");
 if( testecs.toString().indexOf( test ) >= 0 )
    System.err.println("It contains SMALL RUSSIAN B after toString()");
 if( testecs.toString("utf-8").indexOf( test ) >= 0 )
    System.err.println("It contains SMALL RUSSIAN B after toString(\"utf-8\")");
 }

}

Try it with and without the patch I sent you. Clean it as you prefer and solve the
issue.


>
>
> Also, your solution is a bad hack as you don't even wrap things in Buffered
> streams nor does your toString(encoding) method actually use the encoding
> parameter. Huh? You are actually removing functionality from the code.
>

No. I'm simplifying desing. I don't use Buffered streams since StringWriter writes
already to a StringBuffer. I don't use the encoding because it is not needed. We
already have java Strings (i.e. 16 bit characters), and we are not really encoding
neither decoding them, just concatenating them.

If you really want to clean the design, what is needed is:

toString() method should be clean, like it is now.
output(Writer) will be allright once toString() is clean.
output(ByteArrayOutputStream) need to use the getBytes(encoding) from every String,
depending on the version, to write to the ByteArrayOutputStream.
Maybe a getBytes(encoding) would be a useful method. See the discussion above.

>
> So, I'm -1 on this patch.
>

Would you reconsider your position after running the tests? The tests look small
and clean enough to at least point to a clear failure in ECS.

I can send you an attachment with ecs-1.3.3-patched.jar and this small test java
program. The patched jar contains both the source and the .class for
GenericElement.java. But I think it is easy enough to apply the patch and try.

Regards

>
> thanks,
>
> -jon
>
> --
> twice of not very much is still a lot more than not very much

A badly cleaned up patch is betten than living with a BIG bug. ;-)

P.S.) According to Linus, sometimes it is more difficult to find a problem than to
solve a problem. I'm not very good at "polishing" code, but believe I'm good at
pointing to problems.

P.P.S.) I thought quite a few times that the problems were in Turbine, but it was
not true. I'm truly amazed on how clean the Turbine code is. All the Turbine team
are doing a very impressive job. In a more humorous vein, I was desperately trying
to find a bug in your code, to be able to say "So you're not so perfect, uh?". I
was only able to find it in a deprecated part of Turbine, and that shows that
Turbine has a very good code base. I know this is childlish, but that is how we
inprove the quality of the code ;-)

>
>
> --
> --------------------------------------------------------------
> Please read the FAQ! <http://java.apache.org/faq/>
> To subscribe:        [EMAIL PROTECTED]
> To unsubscribe:      [EMAIL PROTECTED]
> Archives and Other:  <http://marc.theaimsgroup.com/?l=jetspeed>
> Problems?:           [EMAIL PROTECTED]



--
--------------------------------------------------------------
Please read the FAQ! <http://java.apache.org/faq/>
To subscribe:        [EMAIL PROTECTED]
To unsubscribe:      [EMAIL PROTECTED]
Archives and Other:  <http://marc.theaimsgroup.com/?l=jetspeed>
Problems?:           [EMAIL PROTECTED]
Re: i18n bug nailed down!

Reply via email to