[power-pro] Re: Unicode bugs?

Sheri Fri, 14 Aug 2009 06:55:53 -0700

--- In [email protected], "entropyreduction" 
<alancampbelllists+ya...@...> wrote:
>
> --- In [email protected], "Sheri" <sherip99@> wrote:
> >
> > How do you do case modifications of unicode?
> > 
> > local subj3u=unicode.new(?"Ärger, Ångstrøm, Über")
> > unicode.messagebox("ok", subj3u.change_case("lower case"), "subju in lower 
> > case") ;;doesn't work
> 
> unicode.messagebox("ok", subj3u.change_case("lower")
> does
> 
> I try to track exactly what equivalent pp function does, and that seems to 
> require exactly "lower", not a string beginning "lower".
> 
> However, I'm not implementing
> 
> case ("tonum", "abc") translates the characters in the string to
> space-separated decimal equivalents. 
> case ("tonumx", "abc") translates the characters in the string to
> space-separated hexadecimal equivalents. 
> case ("fromnum", "97 98 99") translates space-separate decimal
> numbers to corresponding ASCII character
> case ("fromnumx", "61 62 63") translates space-separate
> hexadecimal numbers to corresponding ASCII character
> 
> Any of those of use?
>


I don't know. Probably could be. I found the "lower case" spelling in one of 
the demo unicode scripts I think. BTW, those scripts don't currently run 
without errors.

Actually what I observed is that not surprisingly the case modifiers for 
backreferences in the regex plugin format string do not work for utf8 strings 
involving non-ascii (above 127) characters. If you remember, we have e.g., $u0 
to make $0 in upper case. I will document that the case modifiers should be 
avoided for utf8 (although as long as modding only among the lower 127 
characters I think it is ok).

I suppose it would be possible (if you want) to implement a second set of 
signals in the format string such as x, y, z instead of l, u, t (lower, upper, 
title). I'm just try to avoid impacting the performance of the case mods for 
non-utf8 stuff. So the user would include, e.g. $x0 for lower case $0 in utf8. 
Behind the scene you'd need to convert the backreference from utf8 to unicode, 
modify the case, and convert back to utf8. Hopefully nothing would be added or 
lost in translation.

Or, user could do something similar with pcrereplacecallback to implement 
his/her own case-modded backreferences in a replacement. In a pcrematchall, 
user could output a vector, and do unicode case modifications on the utf8 
elements in the vector.

What do you think?

Of course in the example I gave: "Ärger, Ångstrøm, Über", user can process that 
string in regex pcreservices without the utf8 flag. If Windows is using the 
western code page for non-unicode, the ascii/ansi case mods we already have 
would work fine on that string.

Regards,
Sheri

[power-pro] Re: Unicode bugs?

Reply via email to