On Tue, Feb 10, 2009 at 8:28 PM, Oleg Kalnichevski <[email protected]> wrote: > Markus Wiederkehr wrote: >> >> I've been investigating the current code a little bit more and I've >> come to think that something really goes wrong. Please have a look at >> the code. >> >> Class AbstractEntity has a ByteArrayBuffer "linebuf" and a >> CharArrayBuffer "fieldbuf". Method fillFieldBuffer() copies bytes from >> the underlying stream to "linebuf" (line 146: >> instream.readLine(linebuf)). Later on the input bytes are appended to >> "fieldbuf" (line 143: fieldbuf.append(linebuf, 0, len)). At this point >> bytes are decoded into characters. A closer look at CharArrayBuffer >> reveals how this is done: >> >> int ch = b[i1]; >> if (ch < 0) { >> ch = 256 + ch; >> } >> >> This is equivalent to ISO-8859-1 conversion because Latin 1 is the >> only charset that directly maps byte codes 00 to ff to unicode code >> points 0000 to 00ff. >> >> All works well as long as the underlying stream only contains ASCII bytes. >> >> But assume the message contains non-ASCII bytes and a Content-Type >> field with a charset parameter is also present. In this case the input >> bytes should probably be decoded using that specified charset instead >> of Latin 1. This is the opposite situation to the LENIENT writing mode >> where we encode header fields using the charset from the Content-Type >> field. >> > > To me, parsing of MIME headers using any charset other that US-ASCII never > made any sense of what so ever, but so be it.
Interesting.. I always thought that the lenient writing mode is a requirement coming from the http world or HC more specifically. And if someone wants to write non-ascii chars using the content-type charset it seems only natural that a corresponding parsing mode is also required. > So, in the lenient mode, effectively, we would have to do the following: (1) > parse headers (at least partially) in order to locate Content-Type header > and extract the charset attribute from it, if present; (2) parse all headers > again (probably, lazily) using the charset from the Content-Type. Exactly. > That's quite a bit of extra work. Well, actually not very much, at least when compared to the current code. Currently header fields are already parsed twice, by AbstractEntity and a second time by MessageBuilder. MessageBuilder would only have to keep the raw fields in a list and defer the Field creation until endHeader() is invoked.. >> Okay, so now assume we have parsed that message and use the LENIENT >> writing mode to write it out again. Clearly we have a serious round >> tripping issue now, because Latin 1 was used to decode the fields but >> the potentially different Content-Type charset is used to encode them >> again. >> >> I think the inherent problem is that AbstractEntity attempts to >> convert bytes into characters. This should not happen so early in the >> process. >> >> In my opinion it would be better if AbstractEntity treated a header >> field as a byte array. It would be better to pass a byte array to a >> ContentHandler or a BodyDescriptor. The ContentHandler / >> BodyDescriptor implementation can then decide how to decode the bytes. >> > > This would push the responsibility of detecting the charset and correct > parsing of headers to individual ContentHandler implementations and would > make the task of implementing a ContentHandler more complex, but probably is > the most flexible solution to the problem. A ContentHandler receives everything else as a sequence of bytes; the message body as well as preamble/epilogue of a multipart. Only the headers come as string. >> This could really help with the goal of complete round tripping.. >> Class Field could store the original raw field value in a byte array >> instead a String. >> >> One drawback would be that duplicate parsing of header fields is maybe >> inevitable.. >> >> Opinions? >> > > I am in favor of using ByteArrayBuffer at the ContentHandler level, even > though this would make the task of implementing it more difficult. > > Oleg > >> Markus >> >> PS: I don't indent to stop 0.6 but maybe we should keep the note >> regarding round trip issues in the release notes.
