Alrighty then, I guess we'll stick with the node-based default. Nobody has complained (much) about that for 7-8 years. We do provide a way they can override the default (either by spooling the punch with distcode cp:nnnn or adding a hokey optional RFC822-style comment to the subject line in the form "(cp:nnnn mmmm)", where nnnn is the source EBCDIC codepage and mmmm is the desired ASCII codepage). For a given EBCDIC codepage we choose what we believe to the the best match, ASCII wise (e.g. if we get EBCDIC 939, we emit ASCII 943, aka x-sjis). We do the same in the other direction (i.e. we choose the EBCDIC codepage we believe matches the input ASCII character set the best). We've got support for over 50 different ASCII and 50 different EBCDIC codepages (single and double byte). We derive the SBCS tables on-the-fly by converting from source to Unicode and then to target (although once we've derived a particular from/to table, we cache it so that we don't have to re-derive it until the next recycle). We don't cache DBCS tables (but we do have plugins for several of the more highly used ones to avoid the overhead of the trip to/from Unicode for every email).
I was just hoping to find some clever way to choose the correct source EBCDIC table without having to assume a default (or be told what it is). As to your experiment, VM SMTP is quite limited in this regard (which is why we need this gateway, which interfaces with VM SMTP via the MAILER configuration keyword). -- bc On Sat, Mar 6, 2010 at 3:48 PM, Paul Gilmartin <[email protected]> wrote: > On Mar 6, 2010, at 13:05, Bob Cronin wrote: > > > Yes in general I don't have a filetype. The application is an email > gateway. > > From the responses so far it seems like heuristics are the only approach. > I > > was hoping there might be something more deterministic (although I > suspected > > probably not). > > > Alas, certainly not. The best you can hope for is that if your > file contains a character at a code point invalid in some code > pages, you can eliminate those code pages from consideration. > > You should provide a means for the user to specify a code page, > optionally. > > What do you do if you know the EBCDIC code page? Translate it > to an ASCII or Unicode page which supports all the characters > in the EBCDIC page? > > (Wandering off-topic) I just performed an experiment to confirm > an ugly suspicion. From an ASCII system, I sent a mail message > which contained the MIME headers: > > Content-Type: text/plain; > charset=us-ascii > Content-Transfer-Encoding: quoted-printable > > ... It arrived at a VM system with those headers transformed to: > > Content-transfer-encoding: 7BIT > Content-type: text/plain; CHARSET=US-ASCII > > Ummm... But it's sitting in my reader as an EBCDIC file. Shouldn't > whatever agent transformed it from us-ascii to EBCDIC have adjusted > the headers to: > > Content-transfer-encoding: 8BIT > Content-type: text/plain; CHARSET=IBM-1047 > > or: > > Content-transfer-encoding: 8BIT > Content-type: text/plain; CHARSET=IBM-37-2 > > Whatever? Once the transformation is performed, US-ASCII is a > lie, and there's no way EBCDIC fits in 7 bits. > > I wonder what it would have done to the body and the headers if > the receiving VM system had been in Japan, using EBCDIC 939? > > -- gil
