On Mar 6, 2010, at 13:05, Bob Cronin wrote:
> Yes in general I don't have a filetype. The application is an email gateway.
> From the responses so far it seems like heuristics are the only approach. I
> was hoping there might be something more deterministic (although I suspected
> probably not).
>
Alas, certainly not. The best you can hope for is that if your
file contains a character at a code point invalid in some code
pages, you can eliminate those code pages from consideration.
You should provide a means for the user to specify a code page,
optionally.
What do you do if you know the EBCDIC code page? Translate it
to an ASCII or Unicode page which supports all the characters
in the EBCDIC page?
(Wandering off-topic) I just performed an experiment to confirm
an ugly suspicion. From an ASCII system, I sent a mail message
which contained the MIME headers:
Content-Type: text/plain;
charset=us-ascii
Content-Transfer-Encoding: quoted-printable
... It arrived at a VM system with those headers transformed to:
Content-transfer-encoding: 7BIT
Content-type: text/plain; CHARSET=US-ASCII
Ummm... But it's sitting in my reader as an EBCDIC file. Shouldn't
whatever agent transformed it from us-ascii to EBCDIC have adjusted
the headers to:
Content-transfer-encoding: 8BIT
Content-type: text/plain; CHARSET=IBM-1047
or:
Content-transfer-encoding: 8BIT
Content-type: text/plain; CHARSET=IBM-37-2
Whatever? Once the transformation is performed, US-ASCII is a
lie, and there's no way EBCDIC fits in 7 bits.
I wonder what it would have done to the body and the headers if
the receiving VM system had been in Japan, using EBCDIC 939?
-- gil