Re: Round tripping (MIME4J-112)

Stefano Bagnara Tue, 10 Feb 2009 12:32:52 -0800

Oleg Kalnichevski ha scritto:

Markus Wiederkehr wrote:

I've been investigating the current code a little bit more and I've
come to think that something really goes wrong. Please have a look at
the code.


Class AbstractEntity has a ByteArrayBuffer "linebuf" and a
CharArrayBuffer "fieldbuf". Method fillFieldBuffer() copies bytes from
the underlying stream to "linebuf" (line 146:
instream.readLine(linebuf)). Later on the input bytes are appended to
"fieldbuf" (line 143: fieldbuf.append(linebuf, 0, len)). At this point
bytes are decoded into characters. A closer look at CharArrayBuffer
reveals how this is done:

    int ch = b[i1];
    if (ch < 0) {
        ch = 256 + ch;
    }

This is equivalent to ISO-8859-1 conversion because Latin 1 is the
only charset that directly maps byte codes 00 to ff to unicode code
points 0000 to 00ff.

All works well as long as the underlying stream only contains ASCIIbytes.


But assume the message contains non-ASCII bytes and a Content-Type
field with a charset parameter is also present. In this case the input
bytes should probably be decoded using that specified charset instead
of Latin 1. This is the opposite situation to the LENIENT writing mode
where we encode header fields using the charset from the Content-Type
field.

To me, parsing of MIME headers using any charset other that US-ASCIInever made any sense of what so ever, but so be it.

So, in the lenient mode, effectively, we would have to do the following:(1) parse headers (at least partially) in order to locate Content-Typeheader and extract the charset attribute from it, if present; (2) parseall headers again (probably, lazily) using the charset from theContent-Type.


That's quite a bit of extra work.

If we want to parse real world messages then we have to expect alsonon-7bit-ASCII bytes in headers. They are malformed, but they are verycommon.

IMHO this is a non-issue for "subsequent-roundtripping": mime4j shouldencode them properly in output and be able to roundtrip its own output.

If instead we want to be able to parse any stream (non valid MIME, oreven NON MIME at all.. maybe any binary content???) then we'll have todeal with this and many more stuff: IMHO this would be cool, but a realPITA. I'd be very happy with the "subsequent-roundtripping" (I'm notsure this is the same as Robert's "unlimited round tripping", to be sureI created a new term ;-) ).

Okay, so now assume we have parsed that message and use the LENIENT
writing mode to write it out again. Clearly we have a serious round
tripping issue now, because Latin 1 was used to decode the fields but
the potentially different Content-Type charset is used to encode them
again.

I think the inherent problem is that AbstractEntity attempts to
convert bytes into characters. This should not happen so early in the
process.

In my opinion it would be better if AbstractEntity treated a header
field as a byte array. It would be better to pass a byte array to a
ContentHandler or a BodyDescriptor. The ContentHandler /
BodyDescriptor implementation can then decide how to decode the bytes.

This would push the responsibility of detecting the charset and correctparsing of headers to individual ContentHandler implementations andwould make the task of implementing a ContentHandler more complex, butprobably is the most flexible solution to the problem.

I'm not sure I understand the technical details, but IMHO the "smart"thing is to correctly decode 8bit bytes from headers using the encodingspecified in the same header (maybe in a following header line!!!) whilein output always use encoding (so no 8bit in output from mime4j, ever..This will fix broken messages, I'm not sure how many PGP/DKIM/SMIME likenormalizations this would break....

This could really help with the goal of complete round tripping..
Class Field could store the original raw field value in a byte array
instead a String.

One drawback would be that duplicate parsing of header fields is maybe
inevitable..

Opinions?

I am in favor of using ByteArrayBuffer at the ContentHandler level, eventhough this would make the task of implementing it more difficult.


Oleg

Markus

PS: I don't indent to stop 0.6 but maybe we should keep the note
regarding round trip issues in the release notes.


+1

Stefano

Re: Round tripping (MIME4J-112)

Reply via email to