Re: [PATCH] Internal charset

Alexander Malysh Thu, 11 May 2006 04:16:13 -0700

Hi Peter,

you are right! I must be blind today :( OK, I will look once again through
your patch and commit it.


Peter Christensen wrote:

> Hi Alex,
> 
> Euro sign is 0x80 in windows-1252. It is iso-8859-15 which have euro at
> the same place as the currency sign.
> 
> gsm_to_latin1:
> 
> static const struct {
>      int gsmesc;
>      int latin1;
> } gsm_escapes[] = {
>      {  10, 12 }, /* ASCII page break */
>      {  20, '^' },
>      {  40, '{' },
>      {  41, '}' },
>      {  47, '\\' },
>      {  60, '[' },
>      {  61, '~' },
>      {  62, ']' },
>      {  64, '|' },
>      { 101, 128 },
>      { -1, -1 }
> };
> 
> 101, 128 is €.. 0x1B 0x65 as GSM, 0x80 as windows-1252
> 
> 
> latin1_to_gsm:
> 
> 
>      'x', 'y', 'z', -40, -64, -41, -61, NRP,       /* 120 - 127 */
>      -101, NRP, NRP, NRP, NRP, NRP, NRP, NRP,       /* 128 - 135 */
> 
> Once again, 0x80 is encoded into 0x1B 0x65
> 
> 
> Med venlig hilsen / Best regards
> 
> Peter Christensen
> 
> Developer
> ------------------
> Cool Systems ApS
> 
> Tel: +45 2888 1600
>   @ : [EMAIL PROTECTED]
> www: www.coolsystems.dk
> 
> 
> Alexander Malysh wrote:
>> Hi Peter,
>> 
>> I'm may be blind but I don't see where gsm_to_latin1 and latin1_to_gsm
>> process euro sign?
>> 
>> gsm_to_latin1:
>> - euro sign is not in gsm_escapes
>> - euro sign will be converted to 'e' because esc will be deleted
>> 
>> latin1_to_gsm:
>> - euro sign has the same code as CURRENCY SIGN and will be mapped to it
>> (see latin1_to_gsm array). here is snipplet:
>>         /* 160 - 167 */
>>         ' ',
>>          64, /* Inverted ! */
>>         'c', /* approximation of cent marker */
>>           1, /* Pounds sterling */
>>          36, /* International currency symbol */
>>           3, /* Yen */
>>          64, /* approximate broken bar as inverted ! */
>>          95, /* Section marker */
>> 
>> Does it all make sense for you or I'm overlooked anything?
>> 
>> Thanks,
>> Alex
>> 
>> Peter Christensen wrote:
>> 
>>> Hi,
>>>
>>> The GSM charset have € as an escaped character (0x1B 0x65) and
>>> latin1_to_gsm() and gsm_to_latin1() assume windows-1252 character set.
>>> So while I do admit that the patch i focused on SMPP, I doubt that it
>>> breaks any of the other protocols.
>>>
>>> If I go through each SMSC module:
>>>
>>> smsc_at.c: Never does any charset conversion, but uses latin1_to_gsm and
>>> gsm_to_latin1. So actually, this one already assumes windows-1252.
>>>
>>> smsc_cgw.c: Apparently already assumes windows-1252 (0x80 = €). Does no
>>> generic charset conversion.
>>>
>>> smsc_cimd.c: Uses iso-8859-1. This one will need patching.
>>>
>>> smsc_cmid2.c: iso-8859-1. Needs patching.
>>>
>>> smsc_emi.c: Uses latin1_to_gsm/gsm_to_latin1
>>>
>>> smsc_emi_x25.c: Uses its own gsm_to_iso function. The code looks kinda
>>> deprecated. No support for extended chars at all, apparently.
>>>
>>> smsc_fake.c:
>>>
>>> smsc_http.c: Seems to do no charset conversion
>>>
>>> smsc_ois.c: Uses latin1_to_gsm some places, but a simplified
>>> gsm_to_iso88591 conversion elsewhere.
>>>
>>> smsc_oisd.c: Uses latin1_to_gsm/gsm_to_latin1
>>>
>>> smsc_sema.c: Uses a simplified gsm conversion like the one in smsc_ois.c
>>>
>>> smsc_smasi.c: Not sure what charset this assumes. There are no apparent
>>> charset conversions in place
>>>
>>> smsc_smpp.c: Uses latin1_to_gsm/gsm_to_latin1 and charset conversion.
>>> Currently originator string is windows-1252 and body is iso-8859-1.
>>>
>>> smsc_soap.c: Uses iso-8859-1
>>>
>>> smsc_wrapper.c: No apparent charset conversion
>>>
>>>
>>> My point is, that while some protocols currently assume iso-8859-1, many
>>> uses the latin1_to_gsm/gsm_to_latin1 which is ALREADY windows-1252.
>>> Receipted messages from these gateways are windows-1252 as we speak,
>>> although documentation says otherwise. But as long as smsbox uses
>>> iso-8859-1 and not windows-1252, no gateway can transmit the € character
>>> without manual escaping which I think is lame. If the charset in smsbox
>>> was changed, at least some would have the possibility.
>>>
>>> All this being said, I do agree that using UTF-8 internally is the best
>>> way to go (but I assume that it will take a while before this is done).
>>>
>>>
>>> Med venlig hilsen / Best regards
>>>
>>> Peter Christensen
>>>
>>> Developer
>>> ------------------
>>> Cool Systems ApS
>>>
>>> Tel: +45 2888 1600
>>>   @ : [EMAIL PROTECTED]
>>> www: www.coolsystems.dk
>>>
>>>
>>> Alexander Malysh wrote:
>>>> Hi,
>>>>
>>>> I don't see how your patch should help with euro sign if SMSC supports
>>>> only GSM charset? and your patch is incomplete because it changes only
>>>> SMPP module.
>>>>
>>>> What would be more suitable to support all GSM chars, is to switch
>>>> internal kannel charset to UTF-8. I have patch somewhere but it will
>>>> take some time to rebase it against current CVS and it's too intrusive
>>>> (not 1.4.1 material).
>>>>
>>>> For now it would be easy to keep latin1 as default but allow ESC (27)
>>>> to go through (in gwlib/charset.c change it from NRP to 27) and then
>>>> you should be able to send euro sign via sendsms interface.
>>>>
>>>> Thanks,
>>>> Alex
>>>>
>>>> Peter Christensen wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> At the request of Hillel, I have agreed to update my patch for the
>>>>> internal character set of smsbox/smpp, and post it here, hoping for it
>>>>> to be committed to CVS.
>>>>>
>>>>> It:
>>>>>
>>>>> * Changes the default 7-bit character set of smsbox to windows-1252
>>>>> instead of iso-8859-1, adding support for the euro-sign. (remember
>>>>> that the latin1/gsm conversion functions already assumes windows-1252)
>>>>>
>>>>> * smsbox uses charset_convert instead of octstr_recode, because the
>>>>> latter will convert the euro-sign into a HTML entity.
>>>>>
>>>>> * Changes the internal 7-bit character set of SMPP to windows-1252.
>>>>>
>>>>> * Updates the documentation accordingly.
>>>>>
>>>>>
>>>>> The primary effect of this patch should be support for the € sign in
>>>>> both SMS transmission and reception (at least for gateways, which
>>>>> utilizes the latin1/gsm conversion functions). For the rest, this
>>>>> should have no effect since windows-1252 is identical to iso-8859-1
>>>>> except for 0x80-0x9F which is unused in iso-8859-1.
>>>>>
>>>>> Just to clarify: Unless the problem is in octstr_recode, this patch
>>>>> ONLY adds support for the € (euro) sign. Other characters such as £
>>>>> (pound) also worked before. If a gateway didn't support £ before, it
>>>>> won't do it now either. Besides, this patch does NOT add support for
>>>>> Greek GSM characters!
>>>>>
>>

-- 
Thanks,
Alex

Re: [PATCH] Internal charset

Reply via email to