Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Masayoshi Okutsu Fri, 10 Feb 2012 00:32:49 -0800

I tend to agree with Sherman that the real problem is theOutputStreamWriter API which isn't good enough to handle variousencodings. My understanding is that the charset API was introduced in1.4 to deal with the limitations of the java.io and other encodinghandling issues.

I still don't think it's correct to change the flush semantics. Theflush method should just flush out any outstanding data to the givenoutput stream as defined in java.io.Writer. What if Writer.write(int) isused write UTF-16 data with any stateful encoding? Suppose the statefulencoding can support supplementary characters which require other G0designations, does the following calling sequence work?


writer.write(highSurrogate);
writer.flush();
writer.write(lowSurrogate);
writer.flush();

Of course, this isn't a problem with iso-2022-jp, though.

I think it's a correct fix, not a workaround, to create a filter streamto deal with stateful encodings with the java.io API. If it's OK tosupport only 1.4 and later, the java.nio.charset API should be used.


Thanks,
Masayoshi

On 2/10/2012 4:12 AM, Xueming Shen wrote:

CCed Bill Shannon.

On 02/09/2012 11:10 AM, Xueming Shen wrote:
CharsetEncoder has the "flush()" method as the last step (of a seriesof "encoding" steps) toflush out any internal state to the output buffer. The issue here isthe the upper level wrapperclass, OutputStreamWriter in our case, doesn't provide a "explicit"mechanism to let theuser to request a "flush" on the underlying encoder. The only"guaranteed' mechanism is the"close()" method, in which it appears it not appropriate to invoke insome use scenario, such
as the JavaMail.writeTo() case.
It appears we are lacking of a "close this stream, but not theunderlying stream" mechanismin our layered/chained streams, I have similar request for this kindof mechanism in other area,such as in zip/gzip stream, app wraps a "outputstream" with zip/gzip,they want to release thezip/gzip layer after use (to release the native resource, forexample) but want to keep theunderlying stream unclosed. The only workaround now is to wrap theunderlying stream witha subclass to override the "close()" method, which is really notdesirable.
The OutputStreamWriter.flush() does not explicitly specify in its APIdoc if it should actuallyflush the underlying charset encoder (so I would not argue stronglythat this IS a SE bug) butgiven it is flushing it's buffer (internal status) and then theunderlying "out" stream, it'sreasonable to consider that the "internal status" of its encoder alsoneeds to be flushed.
Especially this has been the behavior for releases earlier than 1.4.2.
As I said, while I have been hesitated to "fix" this problem for awhile (one it has been herefor 3 major releases, two, the API does not explicitly say so) butas long as we don't have areasonable "close-ME-only" mechanism for those layered streams, itappears to be areasonable approach to solve the problem, without having obviousnegative impact.
-Sherman
PS: There is another implementation "detail" that the originaliso-2022-jp c2b converteractually restores the state back to ASCII mode at the end of its"convert" method, this makesthe analysis a little complicated, but should not change the issue weare discussing)
On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:
First of all, is this really a Java SE bug? The usage ofOutputSteamWriter in JavaMail seems to be wrong to me. The writeTomethod in the bug report doesn't seem to be able to deal with anystateful encodings.
Masayoshi

On 2/9/2012 3:26 PM, Xueming Shen wrote:
Hi
This is a long standing "regression" from 1.3.1 on howOutputStreamWriter.flush()/flushBuffer()handles escape or shift sequence in some of the charset/encoding,for example the ISO-2022-JP.
ISO-2022-JP is encoding that starts with ASCII mode and thenswitches between ASCII andJapanesecharacters through an escape sequence. For example, the escapesequence ESC $ B (0x1B, 0x24 0x42)is used to indicate the following bytes are Japanese (switch fromASCII mode to Japanese mode), and
 the ESC ( B (0x1b  0x28  0x42) is used to switch back to ASCII.
In Java's sun.io.CharToByteConvert (old generation charsetconverter) and the nio.io.charset.CharsetEncoderusually switches back forth between ASCII and Japanese modes basedon the input character sequence(for example, if you are in ASCII mode, and your next inputcharacter is a Japanese, you add theESC $ B into the output first and then followed the converted inputcharacter, or if you are in Japanesemode and your next input is ASCII, you output ESC ( B first toswitch the mode and then the ASCII) andswitch back to ASCII mode (if the last mode is non-Japanese) ifeither the encoding is ending or the
flush() method gets invoked.
In JDK1.3.1, OutputStreamWriter.flushBuffer() explicitly invokessun.io.c2b's flushAny() to switchback to ASCII mode every time the flush() or flushBuffer() (fromPrintStream) gets invoked, asshowed at the end of this email. For example, as showed below, thecode uses OutputStreamWriterto "write" a Japanese character \u6700 to the underlying streamwith iso-2022jp,
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
        String str = "\u6700";
OutputStreamWriter osw = new OutputStreamWriter(bos,"iso-2022-jp");
    osw.write(str, 0, str.length());
Since the iso-2022-jp starts with ASCII mode, we now have aJapanese, so we need toswitch into Japanese mode first (the first 3 bytes) and then theencoded Japanese
character (the following 2 bytes)

0x1b 0x24 0x42 0x3a 0x47

and then the code invokes

        osw.flush();

since we are now  in Japanese, the writer continues to write out

 0x1b 0x28 0x 42
to switch back to ASCII mode. The total output is 8 bytes afterwrite() and flush().
However, when all encoidng/charset related codes were migrated from1.3.1's sun.io based to1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2,we gradually migrated fromsun.io to java.nio.charset), the "c2b.flushAny()" invocationobviously was dropped insun.nio.cs.StreamEncoder. It results in that the "switch back toASCII mode" sequence is no longeroutput when OutputStreamWriter.flush() or PrintStream.write(String)is invoked.
This does not trigger problem for most use scenario, if the"stream" finally gets closed(in which the StreamEncoder does invoke encoder's flush() to outputthe escape sequenceto switch back to ASCII) or PrintStream.println(String) is used (inwhich it outputs a \n character,since this \n is in ASCII range, it "accidentally " switches themode back to ASCII).
But it obviously causes problem when you can't not close theOutputStreamWriter afteryou're done your iso2022-jp writing (for example, you need continueto use the underlyingOutputStream for other writing, but not "this" osw), for 1.3.1,these apps invoke osw.flush()to force the output switch back to ASCII, this no longer works whenwe switch to java.nio.charsetin jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2).This is what happened in JavaMail,
as described in the bug report.
The solution is to re-store the "flush the encoder" mechanism inStreamEncoder's flushBuffer().
I have been hesitated to make this change for a while, mostlybecause this regressed behaviorhas been their for 3 releases, and the change triggers yet another"behavior change". But giventhere is no obvious workaround and it only changes the behavior ofthe charsets with thisshift in/out mechanism, mainly the iso-2022 family and those IBMEBCDIC_DBCS charsets, I
decided to give it a try.

Here is the webreview

http://cr.openjdk.java.net/~sherman/6995537/webrev

Sherman
---------------------------------1.3.1OutputStreamWriter-----------------------
    /**
* Flush the output buffer to the underlying byte stream,without flushing* the byte stream itself. This method is non-private only sothat it may
     * be invoked by PrintStream.
     */
    void flushBuffer() throws IOException {
    synchronized (lock) {
        ensureOpen();

        for (;;) {
        try {
            nextByte += ctb.flushAny(bb, nextByte, nBytes);
        }
        catch (ConversionBufferFullException x) {
            nextByte = ctb.nextByteIndex();
        }
        if (nextByte == 0)
            break;
        if (nextByte > 0) {
            out.write(bb, 0, nextByte);
            nextByte = 0;
        }
        }
    }
    }

    /**
     * Flush the stream.
     *
     * @exception  IOException  If an I/O error occurs
     */
    public void flush() throws IOException {
    synchronized (lock) {
        flushBuffer();
        out.flush();
    }
    }

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Reply via email to