Hello, Michael M Slusarz schrieb: > No - this is incorrect. The correct (and unfortunate) answer is that > we can not detect the charset of a text attachment if it is in a > different charset than the browser. Browser upload information does > not contain the charset of the uploaded data, only the type - all we > have to go by is the charset the browser reports to us via the HTTP > headers. ... > The greater issue is that PHP provides us no means to determine what > the charset of the given file is.
You could inspect the leading two or three bytes of the uploaded text file: - If they are EF BB BF, it is almost certainly UTF-8. - If they are FE FF, it is most probably UTF-16BE. - If they are FF FE, it is most probably UTF-16LE. This would correctly identify every Unicode-encoded text file uploaded from a Windows system (which still constitutes the majority of the end-user systems). Of course, this method does not detect every encoding from every end-user system, but it would make a great step toward a correct tagging of text type attachments. If the uploaded text file does not contain a BOM, you could take the first entry from the Accept-Charset header as a guess for the file’s encoding. This is, of course, less reliable, but would be right for most files from out-of-the-box browser installations. To be on the safe side, you could add a Charset field to the Attachments line in the Message Composition form (similar to the Charset field in the header zone of that form). That attachment-charset field would be preset to the value resulting from the procedure outlined above, but would provide an opportunity to override the preset value. > There is nothing wrong > with the way we Q-P - but if we Q-P using the wrong charset, the data > is going to be invalid. To avoid a possible misunderanding of this wording: The Q-P encoding poses no problem, even if sailing under false colours, charsetwise. Q-P simply encodes the bytes 3D, and above 7F, by their hexadekadic values, which will be decoded without any problem. When tagged as UTF-8, as in the examples discussed so far, even the byte-order is sure to be preserved. The only problem is the wrong Charset tag, as it will cause particular byte values (or sequences thereof) to be considered illegal and, in due course, to be replaced with Replacement Characters (or, perhaps, even dropped). Best wishes, Otto Stolz -- IMP mailing list - Join the hunt: http://horde.org/bounties/#imp Frequently Asked Questions: http://horde.org/faq/ To unsubscribe, mail: [EMAIL PROTECTED]
