--- In [email protected], "entropyreduction" <alancampbelllists+ya...@...> wrote: > > --- In [email protected], "Sheri" <sherip99@> wrote: > > > > How do you do case modifications of unicode? > > > > local subj3u=unicode.new(?"Ärger, Ångstrøm, Über") > > unicode.messagebox("ok", subj3u.change_case("lower case"), "subju in lower > > case") ;;doesn't work > > unicode.messagebox("ok", subj3u.change_case("lower") > does > > I try to track exactly what equivalent pp function does, and that seems to > require exactly "lower", not a string beginning "lower". > > However, I'm not implementing > > case ("tonum", "abc") translates the characters in the string to > space-separated decimal equivalents. > case ("tonumx", "abc") translates the characters in the string to > space-separated hexadecimal equivalents. > case ("fromnum", "97 98 99") translates space-separate decimal > numbers to corresponding ASCII character > case ("fromnumx", "61 62 63") translates space-separate > hexadecimal numbers to corresponding ASCII character > > Any of those of use? >
I don't know. Probably could be. I found the "lower case" spelling in one of the demo unicode scripts I think. BTW, those scripts don't currently run without errors. Actually what I observed is that not surprisingly the case modifiers for backreferences in the regex plugin format string do not work for utf8 strings involving non-ascii (above 127) characters. If you remember, we have e.g., $u0 to make $0 in upper case. I will document that the case modifiers should be avoided for utf8 (although as long as modding only among the lower 127 characters I think it is ok). I suppose it would be possible (if you want) to implement a second set of signals in the format string such as x, y, z instead of l, u, t (lower, upper, title). I'm just try to avoid impacting the performance of the case mods for non-utf8 stuff. So the user would include, e.g. $x0 for lower case $0 in utf8. Behind the scene you'd need to convert the backreference from utf8 to unicode, modify the case, and convert back to utf8. Hopefully nothing would be added or lost in translation. Or, user could do something similar with pcrereplacecallback to implement his/her own case-modded backreferences in a replacement. In a pcrematchall, user could output a vector, and do unicode case modifications on the utf8 elements in the vector. What do you think? Of course in the example I gave: "Ärger, Ångstrøm, Über", user can process that string in regex pcreservices without the utf8 flag. If Windows is using the western code page for non-unicode, the ascii/ansi case mods we already have would work fine on that string. Regards, Sheri
