Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Bill Shannon Fri, 10 Feb 2012 10:57:10 -0800

If you flush the stream while in the middle of writing a "character",
I would expect the results to be undefined, just as if you closed the
stream at that point.  But at the end of a consistent set of data, I
would expect flush to behave like close, but without the actual closing
of the stream.


Masayoshi Okutsu wrote on 02/10/12 00:31:

I tend to agree with Sherman that the real problem is the OutputStreamWriter API
which isn't good enough to handle various encodings. My understanding is that
the charset API was introduced in 1.4 to deal with the limitations of the
java.io and other encoding handling issues.

I still don't think it's correct to change the flush semantics. The flush method
should just flush out any outstanding data to the given output stream as defined
in java.io.Writer. What if Writer.write(int) is used write UTF-16 data with any
stateful encoding? Suppose the stateful encoding can support supplementary
characters which require other G0 designations, does the following calling
sequence work?

writer.write(highSurrogate);
writer.flush();
writer.write(lowSurrogate);
writer.flush();

Of course, this isn't a problem with iso-2022-jp, though.

I think it's a correct fix, not a workaround, to create a filter stream to deal
with stateful encodings with the java.io API. If it's OK to support only 1.4 and
later, the java.nio.charset API should be used.

Thanks,
Masayoshi

On 2/10/2012 4:12 AM, Xueming Shen wrote:

CCed Bill Shannon.

On 02/09/2012 11:10 AM, Xueming Shen wrote:


CharsetEncoder has the "flush()" method as the last step (of a series of
"encoding" steps) to
flush out any internal state to the output buffer. The issue here is the the
upper level wrapper
class, OutputStreamWriter in our case, doesn't provide a "explicit" mechanism
to let the
user to request a "flush" on the underlying encoder. The only "guaranteed'
mechanism is the
"close()" method, in which it appears it not appropriate to invoke in some
use scenario, such
as the JavaMail.writeTo() case.

It appears we are lacking of a "close this stream, but not the underlying
stream" mechanism
in our layered/chained streams, I have similar request for this kind of
mechanism in other area,
such as in zip/gzip stream, app wraps a "outputstream" with zip/gzip, they
want to release the
zip/gzip layer after use (to release the native resource, for example) but
want to keep the
underlying stream unclosed. The only workaround now is to wrap the underlying
stream with
a subclass to override the "close()" method, which is really not desirable.

The OutputStreamWriter.flush() does not explicitly specify in its API doc if
it should actually
flush the underlying charset encoder (so I would not argue strongly that this
IS a SE bug) but
given it is flushing it's buffer (internal status) and then the underlying
"out" stream, it's
reasonable to consider that the "internal status" of its encoder also needs
to be flushed.
Especially this has been the behavior for releases earlier than 1.4.2.

As I said, while I have been hesitated to "fix" this problem for a while (one
it has been here
for 3 major releases, two, the API does not explicitly say so) but as long as
we don't have a
reasonable "close-ME-only" mechanism for those layered streams, it appears to
be a
reasonable approach to solve the problem, without having obvious negative
impact.

-Sherman

PS: There is another implementation "detail" that the original iso-2022-jp
c2b converter
actually restores the state back to ASCII mode at the end of its "convert"
method, this makes
the analysis a little complicated, but should not change the issue we are
discussing)


On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:

First of all, is this really a Java SE bug? The usage of OutputSteamWriter
in JavaMail seems to be wrong to me. The writeTo method in the bug report
doesn't seem to be able to deal with any stateful encodings.

Masayoshi

On 2/9/2012 3:26 PM, Xueming Shen wrote:

Hi

This is a long standing "regression" from 1.3.1 on how
OutputStreamWriter.flush()/flushBuffer()
handles escape or shift sequence in some of the charset/encoding, for
example the ISO-2022-JP.

ISO-2022-JP is encoding that starts with ASCII mode and then switches
between ASCII andJapanese
characters through an escape sequence. For example, the escape sequence ESC
$ B (0x1B, 0x24 0x42)
is used to indicate the following bytes are Japanese (switch from ASCII
mode to Japanese mode), and
the ESC ( B (0x1b 0x28 0x42) is used to switch back to ASCII.

In Java's sun.io.CharToByteConvert (old generation charset converter) and
the nio.io.charset.CharsetEncoder
usually switches back forth between ASCII and Japanese modes based on the
input character sequence
(for example, if you are in ASCII mode, and your next input character is a
Japanese, you add the
ESC $ B into the output first and then followed the converted input
character, or if you are in Japanese
mode and your next input is ASCII, you output ESC ( B first to switch the
mode and then the ASCII) and
switch back to ASCII mode (if the last mode is non-Japanese) if either the
encoding is ending or the
flush() method gets invoked.

In JDK1.3.1, OutputStreamWriter.flushBuffer() explicitly invokes
sun.io.c2b's flushAny() to switch
back to ASCII mode every time the flush() or flushBuffer() (from
PrintStream) gets invoked, as
showed at the end of this email. For example, as showed below, the code
uses OutputStreamWriter
to "write" a Japanese character \u6700 to the underlying stream with
iso-2022jp,

ByteArrayOutputStream bos = new ByteArrayOutputStream();
String str = "\u6700";
OutputStreamWriter osw = new OutputStreamWriter(bos, "iso-2022-jp");
osw.write(str, 0, str.length());

Since the iso-2022-jp starts with ASCII mode, we now have a Japanese, so we
need to
switch into Japanese mode first (the first 3 bytes) and then the encoded
Japanese
character (the following 2 bytes)

0x1b 0x24 0x42 0x3a 0x47

and then the code invokes

osw.flush();

since we are now in Japanese, the writer continues to write out

0x1b 0x28 0x 42

to switch back to ASCII mode. The total output is 8 bytes after write() and
flush().

However, when all encoidng/charset related codes were migrated from 1.3.1's
sun.io based to
1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2, we
gradually migrated from
sun.io to java.nio.charset), the "c2b.flushAny()" invocation obviously was
dropped in
sun.nio.cs.StreamEncoder. It results in that the "switch back to ASCII
mode" sequence is no longer
output when OutputStreamWriter.flush() or PrintStream.write(String) is
invoked.

This does not trigger problem for most use scenario, if the "stream"
finally gets closed
(in which the StreamEncoder does invoke encoder's flush() to output the
escape sequence
to switch back to ASCII) or PrintStream.println(String) is used (in which
it outputs a \n character,
since this \n is in ASCII range, it "accidentally " switches the mode back
to ASCII).

But it obviously causes problem when you can't not close the
OutputStreamWriter after
you're done your iso2022-jp writing (for example, you need continue to use
the underlying
OutputStream for other writing, but not "this" osw), for 1.3.1, these apps
invoke osw.flush()
to force the output switch back to ASCII, this no longer works when we
switch to java.nio.charset
in jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2). This is
what happened in JavaMail,
as described in the bug report.

The solution is to re-store the "flush the encoder" mechanism in
StreamEncoder's flushBuffer().

I have been hesitated to make this change for a while, mostly because this
regressed behavior
has been their for 3 releases, and the change triggers yet another
"behavior change". But given
there is no obvious workaround and it only changes the behavior of the
charsets with this
shift in/out mechanism, mainly the iso-2022 family and those IBM
EBCDIC_DBCS charsets, I
decided to give it a try.

Here is the webreview

http://cr.openjdk.java.net/~sherman/6995537/webrev

Sherman


---------------------------------1.3.1
OutputStreamWriter-----------------------
/**
* Flush the output buffer to the underlying byte stream, without flushing
* the byte stream itself. This method is non-private only so that it may
* be invoked by PrintStream.
*/
void flushBuffer() throws IOException {
synchronized (lock) {
ensureOpen();

for (;;) {
try {
nextByte += ctb.flushAny(bb, nextByte, nBytes);
}
catch (ConversionBufferFullException x) {
nextByte = ctb.nextByteIndex();
}
if (nextByte == 0)
break;
if (nextByte > 0) {
out.write(bb, 0, nextByte);
nextByte = 0;
}
}
}
}

/**
* Flush the stream.
*
* @exception IOException If an I/O error occurs
*/
public void flush() throws IOException {
synchronized (lock) {
flushBuffer();
out.flush();
}
}

Re: Codereview request for 6995537: different behavior in iso-2022-jp encoding between jdk131/140/141 and jdk142/5/6/7

Reply via email to