Hello, Michael M Slusarz had written:
Browser upload information does not contain the charset of the uploaded data, only the type - all we have to go by is the charset the browser reports to us via the HTTP headers.
This morning, I have tried to learn, from RFC 2616, the syntax and requirements for uploading files via POST; but I couldn’t make head or tail of it. At least, I have found, in section 3.7.1 <http://rfc-ref.org/RFC-TEXTS/2616/chapter3.html#sub7sub1>, the requirement:
When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.
Apparently, the former clause says that Imp should tag uploaded text-type attachments with ‘charset=ISO-8859-1’, if no charset parameter is given by the browser. Apparently, the latter clause says that all browsers are buggy, as they do not provide a charset parameter when uploading text files. Or am I mistaken? So I am still in doubt, whether the right way is to lobby for better, standard-conforming browsers, or to mend Imp to cope with current browsers’ behaviour (as discussed below), or even both. I had proposed:
You could inspect the leading two or three bytes of the uploaded text file: - If they are EF BB BF, it is almost certainly UTF-8. - If they are FE FF, it is most probably UTF-16BE. - If they are FF FE, it is most probably UTF-16LE.
Wikipedia says:
Although not part of the standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8.
This information is obsolete. The Unicode Standard 5.0 says otherwise, cf. Table 2-4 in <http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273>, sub-section ‘Unicode Signature’ in <http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G9354>, and <http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf#G25817>. Note, however, that an attachment commencing with a BOM should be tagged with ‘charset=UTF-16’. If it is tagged with ‘charset=UTF-16BE’, say, an initial FEFF code-unit would be interpreted as a Zero Width No-Breaking Space, which would not be visible in the rendering, but could well impede the automatic processing of the data. I had also proposed:
If the uploaded text file does not contain a BOM, you could take the first entry from the Accept-Charset header as a guess for the file’s encoding. This is, of course, less reliable, but would be right for most files from out-of-the-box browser installations.
Andrew Morgan has commented:
If IMP is not able to use the Byte Order Mark to detect the encoding, then it should assume the file is encoded using the currently selected language/encoding in Horde.
My rationale is that, in a typical Windows system, most text files will be stored in the system codepage, e. g. CP 1252 in a German system; likewise, most browsers would be configured to accept the system codepage as 1st priority (or an almost compatible one, such as ISO 8859-1, in a German system). So you could take the browser’s Accept-Encoding header as a hint for the prevaelnt encoding of text files to be uploaded from that very system. I am not familiar, though, with Mac, and Linux, workstations, so I cannot exclude that they might deserve a different treatment (which could be based on the User-Agent header). Any opinions from experts? In contrast, the language in Horde is selected by the current user of the system, e. g. a guest in an internet-shop, or a student at a public workstation in our university. Of course, they could bring in their text files on memory sticks, but they also could cut various texts from various sources and paste and then store them locally, using the system codepage. I am really not sure what will be the more common case. And the Horde encoding is selected by the translator of the language files. For many languages, several different encodings are possible, and even widely used. Hence, from the language selected by the Horde user, you cannot reliably infer the pertinent encoding (let alone the encoding of the files uploaded by him). I had also proposed:
To be on the safe side, you could add a Charset field to the Attachments line in the Message Composition form (similar to the Charset field in the header zone of that form).
Andrew Morgan commented:
This is probably overkill, and would certainly clutter the interface a lot. :)
If the POST info does not (and never will) specify the encoding of an uploaded text file, this is the only feasable way to comply with section 4.1.2 of RFC 2046 <http://rfc-ref.org/RFC-TEXTS/2046/chapter4.html#sub1sub2>. And I think, this will not clutter the interface, cf. attached screen shots (faked, of course). I had also written:
That attachment-charset field would be preset to the value resulting from the procedure outlined above, but would provide an opportunity to override the preset value.
On second thought, I am not sure whether this is feasable: This would amount to reading the file, via JavaScript, immediately after it has been selected, and before it is uploaded. I have not delved enough into JavaScript to assess the feasibility of this approach. But if JavaScript code is used to guess the encoding of the text file to be uploaded, the second step proposed above does not apply; rather, the JavaScript code should be able to find the system codepage, directly. Best wishes, Otto Stolz
<<inline: Imp-attachment-updated.png>>
<<inline: Imp-attachment.png>>
-- IMP mailing list - Join the hunt: http://horde.org/bounties/#imp Frequently Asked Questions: http://horde.org/faq/ To unsubscribe, mail: [EMAIL PROTECTED]
