Re: [PATCH] Internal charset

Peter Christensen Thu, 11 May 2006 04:06:55 -0700

Hi Alex,

Euro sign is 0x80 in windows-1252. It is iso-8859-15 which have euro atthe same place as the currency sign.


gsm_to_latin1:

static const struct {
    int gsmesc;
    int latin1;
} gsm_escapes[] = {
    {  10, 12 }, /* ASCII page break */
    {  20, '^' },
    {  40, '{' },
    {  41, '}' },
    {  47, '\\' },
    {  60, '[' },
    {  61, '~' },
    {  62, ']' },
    {  64, '|' },
    { 101, 128 },
    { -1, -1 }
};

101, 128 is €.. 0x1B 0x65 as GSM, 0x80 as windows-1252


latin1_to_gsm:


    'x', 'y', 'z', -40, -64, -41, -61, NRP,       /* 120 - 127 */
    -101, NRP, NRP, NRP, NRP, NRP, NRP, NRP,       /* 128 - 135 */

Once again, 0x80 is encoded into 0x1B 0x65


Med venlig hilsen / Best regards

Peter Christensen

Developer
------------------
Cool Systems ApS

Tel: +45 2888 1600
 @ : [EMAIL PROTECTED]
www: www.coolsystems.dk


Alexander Malysh wrote:

Hi Peter,

I'm may be blind but I don't see where gsm_to_latin1 and latin1_to_gsm
process euro sign?

gsm_to_latin1:
- euro sign is not in gsm_escapes
- euro sign will be converted to 'e' because esc will be deleted

latin1_to_gsm:
- euro sign has the same code as CURRENCY SIGN and will be mapped to it (see
latin1_to_gsm array). here is snipplet:
        /* 160 - 167 */
        ' ',
         64, /* Inverted ! */
        'c', /* approximation of cent marker */
          1, /* Pounds sterling */
         36, /* International currency symbol */
          3, /* Yen */
         64, /* approximate broken bar as inverted ! */
         95, /* Section marker */

Does it all make sense for you or I'm overlooked anything?

Thanks,
Alex

Peter Christensen wrote:

Hi,

The GSM charset have € as an escaped character (0x1B 0x65) and
latin1_to_gsm() and gsm_to_latin1() assume windows-1252 character set.
So while I do admit that the patch i focused on SMPP, I doubt that it
breaks any of the other protocols.

If I go through each SMSC module:

smsc_at.c: Never does any charset conversion, but uses latin1_to_gsm and
gsm_to_latin1. So actually, this one already assumes windows-1252.

smsc_cgw.c: Apparently already assumes windows-1252 (0x80 = €). Does no
generic charset conversion.

smsc_cimd.c: Uses iso-8859-1. This one will need patching.

smsc_cmid2.c: iso-8859-1. Needs patching.

smsc_emi.c: Uses latin1_to_gsm/gsm_to_latin1

smsc_emi_x25.c: Uses its own gsm_to_iso function. The code looks kinda
deprecated. No support for extended chars at all, apparently.

smsc_fake.c:

smsc_http.c: Seems to do no charset conversion

smsc_ois.c: Uses latin1_to_gsm some places, but a simplified
gsm_to_iso88591 conversion elsewhere.

smsc_oisd.c: Uses latin1_to_gsm/gsm_to_latin1

smsc_sema.c: Uses a simplified gsm conversion like the one in smsc_ois.c

smsc_smasi.c: Not sure what charset this assumes. There are no apparent
charset conversions in place

smsc_smpp.c: Uses latin1_to_gsm/gsm_to_latin1 and charset conversion.
Currently originator string is windows-1252 and body is iso-8859-1.

smsc_soap.c: Uses iso-8859-1

smsc_wrapper.c: No apparent charset conversion


My point is, that while some protocols currently assume iso-8859-1, many
uses the latin1_to_gsm/gsm_to_latin1 which is ALREADY windows-1252.
Receipted messages from these gateways are windows-1252 as we speak,
although documentation says otherwise. But as long as smsbox uses
iso-8859-1 and not windows-1252, no gateway can transmit the € character
without manual escaping which I think is lame. If the charset in smsbox
was changed, at least some would have the possibility.

All this being said, I do agree that using UTF-8 internally is the best
way to go (but I assume that it will take a while before this is done).


Med venlig hilsen / Best regards

Peter Christensen

Developer
------------------
Cool Systems ApS

Tel: +45 2888 1600
  @ : [EMAIL PROTECTED]
www: www.coolsystems.dk


Alexander Malysh wrote:

Hi,

I don't see how your patch should help with euro sign if SMSC supports
only GSM charset? and your patch is incomplete because it changes only
SMPP module.

What would be more suitable to support all GSM chars, is to switch
internal kannel charset to UTF-8. I have patch somewhere but it will take
some time to rebase it against current CVS and it's too intrusive (not
1.4.1 material).

For now it would be easy to keep latin1 as default but allow ESC (27) to
go through (in gwlib/charset.c change it from NRP to 27) and then you
should be able to send euro sign via sendsms interface.

Thanks,
Alex

Peter Christensen wrote:

Hi,

At the request of Hillel, I have agreed to update my patch for the
internal character set of smsbox/smpp, and post it here, hoping for it
to be committed to CVS.

It:

* Changes the default 7-bit character set of smsbox to windows-1252
instead of iso-8859-1, adding support for the euro-sign. (remember that
the latin1/gsm conversion functions already assumes windows-1252)

* smsbox uses charset_convert instead of octstr_recode, because the
latter will convert the euro-sign into a HTML entity.

* Changes the internal 7-bit character set of SMPP to windows-1252.

* Updates the documentation accordingly.


The primary effect of this patch should be support for the € sign in
both SMS transmission and reception (at least for gateways, which
utilizes the latin1/gsm conversion functions). For the rest, this should
have no effect since windows-1252 is identical to iso-8859-1 except for
0x80-0x9F which is unused in iso-8859-1.

Just to clarify: Unless the problem is in octstr_recode, this patch ONLY
adds support for the € (euro) sign. Other characters such as £ (pound)
also worked before. If a gateway didn't support £ before, it won't do it
now either. Besides, this patch does NOT add support for Greek GSM
characters!

Re: [PATCH] Internal charset

Reply via email to