On Wed, 26 Mar 2008, Otto Stolz wrote:

Hello,

Michael M Slusarz schrieb:
No - this is incorrect.  The correct (and unfortunate) answer is that
we can not detect the charset of a text attachment if it is in a
different charset than the browser.  Browser upload information does
not contain the charset of the uploaded data, only the type - all we
have to go by is the charset the browser reports to us via the HTTP
headers.
...
The greater issue is that PHP provides us no means to determine what
the charset of the given file is.

You could inspect the leading two or three bytes of the uploaded
text file:
- If they are EF BB BF, it is almost certainly UTF-8.
- If they are FE FF, it is most probably UTF-16BE.
- If they are FF FE, it is most probably UTF-16LE.

This would correctly identify every Unicode-encoded text file
uploaded from a Windows system (which still constitutes the
majority of the end-user systems). Of course, this method does
not detect every encoding from every end-user system, but it
would make a great step toward a correct tagging of text type
attachments.

This seems like a reasonable method to detect UTF-16 encoded text files. I'm not sure about using it for UTF-8 though. Wikipedia says:

  Although not part of the standard, many Windows programs (including
  Windows Notepad) use the byte sequence EF BB BF at the beginning of a
  file to indicate that the file is encoded using UTF-8. This is the Byte
  Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1
  characters "" in most text editors and web browsers not prepared
  to handle UTF-8.

If the uploaded text file does not contain a BOM, you could
take the first entry from the Accept-Charset header as a guess
for the file?s encoding. This is, of course, less reliable,
but would be right for most files from out-of-the-box browser
installations.

If IMP is not able to use the Byte Order Mark to detect the encoding, then it should assume the file is encoded using the currently selected language/encoding in Horde.

To be on the safe side, you could add a Charset field to the
Attachments line in the Message Composition form (similar to
the Charset field in the header zone of that form). That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.

This is probably overkill, and would certainly clutter the interface a lot. :)

        Andy
-- 
IMP mailing list - Join the hunt: http://horde.org/bounties/#imp
Frequently Asked Questions: http://horde.org/faq/
To unsubscribe, mail: [EMAIL PROTECTED]

Reply via email to