Re: [ast-developers] [patch] Updated $'\w[hex]' patch for GB18030&&co. ... / was: Re: [patch] Accessing widechar codepoints without unicode (GB18030-related) ...

Lionel Cons Mon, 02 Sep 2013 11:17:10 -0700

On 2 September 2013 17:11, Wendy Lin <[email protected]> wrote:
> On 2 September 2013 05:50, Roland Mainz <[email protected]> wrote:
>> On Mon, Sep 2, 2013 at 1:42 AM, Roland Mainz <[email protected]> 
>> wrote:
>>> On Mon, Sep 2, 2013 at 1:09 AM, Roland Mainz <[email protected]> 
>>> wrote:
>>>> On Mon, Sep 2, 2013 at 1:06 AM, Roland Mainz <[email protected]> 
>>>> wrote:
>>>>> On Mon, Sep 2, 2013 at 12:36 AM, Roland Mainz <[email protected]> 
>>>>> wrote:
>>>>>> On Mon, Aug 5, 2013 at 5:01 AM, Roland Mainz <[email protected]> 
>>>>>> wrote:
>>>>>>> On Mon, Aug 5, 2013 at 4:13 AM, Roland Mainz <[email protected]> 
>>>>>>> wrote:
>>>> [snip]
>>>>> ** More notes:
>>>>> 1. $ ksh -c 'export LC_ALL=en_US.ISO8859-15 ; printf "x\u[20ac]x\n" |
>>>>> iconv -f ISO8859-15 -t UTF-8' # now works... it the correct outpput is
>>>>> "x€x"
>>>>> 2. The reason why this didn't work in the *002* patch was that the
>>>>> original code in ast-ksh.2013-08-29 used |wc2utf8()| on an "extended
>>>>> single-byte locale" like "en_US.ISO8859-15" ... this can **never**
>>>>> work because the locale is not UTF-8 based
>>>>>
>>>>> Glenn/David: What do you think about the patch ?
>>>>
>>>> I forgot one note:
>>>> - The patch _explicitly_ uses |iconv()| even for UTF-8 locales to
>>>> weed-out unassigned codepoints to fullfit the unicode requirement that
>>>> no unassigned codepoints should be accessible.
>>>
>>> Last updated patch for tonight:
>>>
>>> Attached (as "astksh20130829_printf_w_gb18030_004.diff.txt") is an
>>> updated version of the patch which now automagically uses "\u[hex]" as
>>> output instead of "\w[hex]" for UTF-8 locales, making the output 100%
>>> compatible to previous ksh93 versions except for the describes bugs in
>>> those versions.
>>>
>>> BTW: Some example usage for $ set -o convunicode # (byte "a4" is the
>>> Euro character in ISO8859-15):
>>> -- snip --
>>> $ ksh -c 'export LC_ALL=en_US.ISO8859-15 ; printf "euro=|%q|\n"
>>> "$(printf "\xa4")" | iconv -f ISO8859-15 -t UTF-8'
>>> euro=|$'€'|
>>> $ ksh -o convunicode -c 'export LC_ALL=en_US.ISO8859-15 ; printf
>>> "euro=|%q|\n" "$(printf "\xa4")" | iconv -f ISO8859-15 -t UTF-8'
>>> euro=|$'\u[20ac]'|
>>> -- snip --
>>>
>>> Comments/rants/etc. welcome...
>>>
>>> ... and David/Glenn: Please don't remove the comments in the code if
>>> you take the patch... there's a reason why I'm quite verbose in the
>>> comments (short: Hideously complex and lots of traps in the code) ...
>>
>> Attached (as "astksh20130829_printf_w_gb18030_005.diff.txt") is a
>> fixed patch... the previous one missed a |continue;| statement which
>> caused failures in the "locale.sh" test module (found by Wang Shouhua)
>> ...
>
> Roland, thank you very much for the patch. I've been testing it the
> last couple of hours in both Japanese and Chinese environments and
> have to say: I am impressed. I can now address individual characters
> just by their hexadecimal Unicode value, and it works in any locale.
> This improves portability a lot and brings ksh93 in parity with perl.


I think its even more impressive that \u[] can now be used in
en_GB.iso885915 (which is a singlebyte locale) to pick characters if
supported.

The *shame* is that singlebyte locales like en_GB.iso885915 were
broken for such a long time. Kudos to Roland Mainz for fixing the
problems.

Lionel
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] [patch] Updated $'\w[hex]' patch for GB18030&&co. ... / was: Re: [patch] Accessing widechar codepoints without unicode (GB18030-related) ...

Reply via email to