Re: [imp] problem with attachments in unicode (UTF16)

Otto Stolz Thu, 27 Mar 2008 05:01:52 -0700

Hello,

Michael M Slusarz had written:

Browser upload information does
not contain the charset of the uploaded data, only the type - all we
have to go by is the charset the browser reports to us via the HTTP
headers.


This morning, I have tried to learn, from RFC 2616, the syntax and
requirements for uploading files via POST; but I couldn’t make head
or tail of it. At least, I have found, in section 3.7.1
<http://rfc-ref.org/RFC-TEXTS/2616/chapter3.html#sub7sub1>, the
requirement:

When no explicit charset parameter is provided by the sender,
media subtypes of the "text" type are defined to have a default
charset value of "ISO-8859-1" when received via HTTP. Data in
character sets other than "ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset value.


Apparently, the former clause says that Imp should tag uploaded
text-type attachments with ‘charset=ISO-8859-1’, if no charset
parameter is given by the browser. Apparently, the latter clause
says that all browsers are buggy, as they do not provide a
charset parameter when uploading text files. Or am I mistaken?

So I am still in doubt, whether the right way is to lobby for
better, standard-conforming browsers, or to mend Imp to cope with
current browsers’ behaviour (as discussed below), or even both.

I had proposed:

You could inspect the leading two or three bytes of the uploaded
text file:
- If they are EF BB BF, it is almost certainly UTF-8.
- If they are FE FF, it is most probably UTF-16BE.
- If they are FF FE, it is most probably UTF-16LE.


Wikipedia says:

  Although not part of the standard, many Windows programs (including
  Windows Notepad) use the byte sequence EF BB BF at the beginning of a
  file to indicate that the file is encoded using UTF-8.


This information is obsolete. The Unicode Standard 5.0 says otherwise,
cf. Table 2-4 in <http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273>,
sub-section ‘Unicode Signature’ in
<http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G9354>,
and <http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf#G25817>.

Note, however, that an attachment commencing with a BOM
should be tagged with ‘charset=UTF-16’. If it is tagged with
‘charset=UTF-16BE’, say, an initial FEFF code-unit would be interpreted
as a Zero Width No-Breaking Space, which would not be visible in the
rendering, but could well impede the automatic processing of the
data.

I had also proposed:

If the uploaded text file does not contain a BOM, you could
take the first entry from the Accept-Charset header as a guess
for the file’s encoding. This is, of course, less reliable,
but would be right for most files from out-of-the-box browser
installations.


Andrew Morgan has commented:

If IMP is not able to use the Byte Order Mark to detect the encoding,then it should assume the file is encoded using the currently selectedlanguage/encoding in Horde.


My rationale is that, in a typical Windows system, most text files
will be stored in the system codepage, e. g. CP 1252 in a German
system; likewise, most browsers would be configured to accept the
system codepage as 1st priority (or an almost compatible one, such
as ISO 8859-1, in a German system). So you could take the browser’s
Accept-Encoding header as a hint for the prevaelnt encoding of text
files to be uploaded from that very system.

I am not familiar, though, with Mac, and Linux, workstations, so I
cannot exclude that they might deserve a different treatment (which
could be based on the User-Agent header). Any opinions from experts?

In contrast, the language in Horde is selected by the
current user of the system, e. g. a guest in an internet-shop,
or a student at a public workstation in our university. Of course,
they could bring in their text files on memory sticks, but they
also could cut various texts from various sources and paste and
then store them locally, using the system codepage. I am really
not sure what will be the more common case.

And the Horde encoding is selected by the translator of the
language files. For many languages, several different encodings
are possible, and even widely used. Hence, from the language
selected by the Horde user, you cannot reliably infer the pertinent
encoding (let alone the encoding of the files uploaded by him).

I had also proposed:

To be on the safe side, you could add a Charset field to the
Attachments line in the Message Composition form (similar to
the Charset field in the header zone of that form).


Andrew Morgan commented:

This is probably overkill, and would certainly clutter the interface alot. :)


If the POST info does not (and never will) specify the encoding of
an uploaded text file, this is the only feasable way to comply with
section 4.1.2 of RFC 2046
<http://rfc-ref.org/RFC-TEXTS/2046/chapter4.html#sub1sub2>.

And I think, this will not clutter the interface, cf. attached
screen shots (faked, of course).

I had also written:

That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.


On second thought, I am not sure whether this is feasable:
This would amount to reading the file, via JavaScript, immediately
after it has been selected, and before it is uploaded. I have not
delved enough into JavaScript to assess the feasibility of this
approach. But if JavaScript code is used to guess the encoding
of the text file to be uploaded, the second step proposed above
does not apply; rather, the JavaScript code should be able to
find the system codepage, directly.

Best wishes,
  Otto Stolz

<<inline: Imp-attachment-updated.png>>

<<inline: Imp-attachment.png>>

-- 
IMP mailing list - Join the hunt: http://horde.org/bounties/#imp
Frequently Asked Questions: http://horde.org/faq/
To unsubscribe, mail: [EMAIL PROTECTED]

Re: [imp] problem with attachments in unicode (UTF16)

Reply via email to