I tend to agree with Sherman that the real problem is the
OutputStreamWriter API which isn't good enough to handle various
encodings. My understanding is that the charset API was introduced in
1.4 to deal with the limitations of the java.io and other encoding
handling issues.
I still don't think it's correct to change the flush semantics. The
flush method should just flush out any outstanding data to the given
output stream as defined in java.io.Writer. What if Writer.write(int) is
used write UTF-16 data with any stateful encoding? Suppose the stateful
encoding can support supplementary characters which require other G0
designations, does the following calling sequence work?
writer.write(highSurrogate);
writer.flush();
writer.write(lowSurrogate);
writer.flush();
Of course, this isn't a problem with iso-2022-jp, though.
I think it's a correct fix, not a workaround, to create a filter stream
to deal with stateful encodings with the java.io API. If it's OK to
support only 1.4 and later, the java.nio.charset API should be used.
Thanks,
Masayoshi
On 2/10/2012 4:12 AM, Xueming Shen wrote:
CCed Bill Shannon.
On 02/09/2012 11:10 AM, Xueming Shen wrote:
CharsetEncoder has the "flush()" method as the last step (of a series
of "encoding" steps) to
flush out any internal state to the output buffer. The issue here is
the the upper level wrapper
class, OutputStreamWriter in our case, doesn't provide a "explicit"
mechanism to let the
user to request a "flush" on the underlying encoder. The only
"guaranteed' mechanism is the
"close()" method, in which it appears it not appropriate to invoke in
some use scenario, such
as the JavaMail.writeTo() case.
It appears we are lacking of a "close this stream, but not the
underlying stream" mechanism
in our layered/chained streams, I have similar request for this kind
of mechanism in other area,
such as in zip/gzip stream, app wraps a "outputstream" with zip/gzip,
they want to release the
zip/gzip layer after use (to release the native resource, for
example) but want to keep the
underlying stream unclosed. The only workaround now is to wrap the
underlying stream with
a subclass to override the "close()" method, which is really not
desirable.
The OutputStreamWriter.flush() does not explicitly specify in its API
doc if it should actually
flush the underlying charset encoder (so I would not argue strongly
that this IS a SE bug) but
given it is flushing it's buffer (internal status) and then the
underlying "out" stream, it's
reasonable to consider that the "internal status" of its encoder also
needs to be flushed.
Especially this has been the behavior for releases earlier than 1.4.2.
As I said, while I have been hesitated to "fix" this problem for a
while (one it has been here
for 3 major releases, two, the API does not explicitly say so) but
as long as we don't have a
reasonable "close-ME-only" mechanism for those layered streams, it
appears to be a
reasonable approach to solve the problem, without having obvious
negative impact.
-Sherman
PS: There is another implementation "detail" that the original
iso-2022-jp c2b converter
actually restores the state back to ASCII mode at the end of its
"convert" method, this makes
the analysis a little complicated, but should not change the issue we
are discussing)
On 02/09/2012 12:26 AM, Masayoshi Okutsu wrote:
First of all, is this really a Java SE bug? The usage of
OutputSteamWriter in JavaMail seems to be wrong to me. The writeTo
method in the bug report doesn't seem to be able to deal with any
stateful encodings.
Masayoshi
On 2/9/2012 3:26 PM, Xueming Shen wrote:
Hi
This is a long standing "regression" from 1.3.1 on how
OutputStreamWriter.flush()/flushBuffer()
handles escape or shift sequence in some of the charset/encoding,
for example the ISO-2022-JP.
ISO-2022-JP is encoding that starts with ASCII mode and then
switches between ASCII andJapanese
characters through an escape sequence. For example, the escape
sequence ESC $ B (0x1B, 0x24 0x42)
is used to indicate the following bytes are Japanese (switch from
ASCII mode to Japanese mode), and
the ESC ( B (0x1b 0x28 0x42) is used to switch back to ASCII.
In Java's sun.io.CharToByteConvert (old generation charset
converter) and the nio.io.charset.CharsetEncoder
usually switches back forth between ASCII and Japanese modes based
on the input character sequence
(for example, if you are in ASCII mode, and your next input
character is a Japanese, you add the
ESC $ B into the output first and then followed the converted input
character, or if you are in Japanese
mode and your next input is ASCII, you output ESC ( B first to
switch the mode and then the ASCII) and
switch back to ASCII mode (if the last mode is non-Japanese) if
either the encoding is ending or the
flush() method gets invoked.
In JDK1.3.1, OutputStreamWriter.flushBuffer() explicitly invokes
sun.io.c2b's flushAny() to switch
back to ASCII mode every time the flush() or flushBuffer() (from
PrintStream) gets invoked, as
showed at the end of this email. For example, as showed below, the
code uses OutputStreamWriter
to "write" a Japanese character \u6700 to the underlying stream
with iso-2022jp,
ByteArrayOutputStream bos = new ByteArrayOutputStream();
String str = "\u6700";
OutputStreamWriter osw = new OutputStreamWriter(bos,
"iso-2022-jp");
osw.write(str, 0, str.length());
Since the iso-2022-jp starts with ASCII mode, we now have a
Japanese, so we need to
switch into Japanese mode first (the first 3 bytes) and then the
encoded Japanese
character (the following 2 bytes)
0x1b 0x24 0x42 0x3a 0x47
and then the code invokes
osw.flush();
since we are now in Japanese, the writer continues to write out
0x1b 0x28 0x 42
to switch back to ASCII mode. The total output is 8 bytes after
write() and flush().
However, when all encoidng/charset related codes were migrated from
1.3.1's sun.io based to
1.4's java.nio.charset based implementation (1.4, 1.4.1 and 1.4.2,
we gradually migrated from
sun.io to java.nio.charset), the "c2b.flushAny()" invocation
obviously was dropped in
sun.nio.cs.StreamEncoder. It results in that the "switch back to
ASCII mode" sequence is no longer
output when OutputStreamWriter.flush() or PrintStream.write(String)
is invoked.
This does not trigger problem for most use scenario, if the
"stream" finally gets closed
(in which the StreamEncoder does invoke encoder's flush() to output
the escape sequence
to switch back to ASCII) or PrintStream.println(String) is used (in
which it outputs a \n character,
since this \n is in ASCII range, it "accidentally " switches the
mode back to ASCII).
But it obviously causes problem when you can't not close the
OutputStreamWriter after
you're done your iso2022-jp writing (for example, you need continue
to use the underlying
OutputStream for other writing, but not "this" osw), for 1.3.1,
these apps invoke osw.flush()
to force the output switch back to ASCII, this no longer works when
we switch to java.nio.charset
in jdk1.4.2. (we migrated iso-2022-jp to nio.charset in 1.4.2).
This is what happened in JavaMail,
as described in the bug report.
The solution is to re-store the "flush the encoder" mechanism in
StreamEncoder's flushBuffer().
I have been hesitated to make this change for a while, mostly
because this regressed behavior
has been their for 3 releases, and the change triggers yet another
"behavior change". But given
there is no obvious workaround and it only changes the behavior of
the charsets with this
shift in/out mechanism, mainly the iso-2022 family and those IBM
EBCDIC_DBCS charsets, I
decided to give it a try.
Here is the webreview
http://cr.openjdk.java.net/~sherman/6995537/webrev
Sherman
---------------------------------1.3.1
OutputStreamWriter-----------------------
/**
* Flush the output buffer to the underlying byte stream,
without flushing
* the byte stream itself. This method is non-private only so
that it may
* be invoked by PrintStream.
*/
void flushBuffer() throws IOException {
synchronized (lock) {
ensureOpen();
for (;;) {
try {
nextByte += ctb.flushAny(bb, nextByte, nBytes);
}
catch (ConversionBufferFullException x) {
nextByte = ctb.nextByteIndex();
}
if (nextByte == 0)
break;
if (nextByte > 0) {
out.write(bb, 0, nextByte);
nextByte = 0;
}
}
}
}
/**
* Flush the stream.
*
* @exception IOException If an I/O error occurs
*/
public void flush() throws IOException {
synchronized (lock) {
flushBuffer();
out.flush();
}
}