On Tue, Mar 2, 2010 at 4:39 PM, Stephan Hennig <[email protected]> wrote: > Am 02.03.2010 14:41, schrieb luigi scarso: >> >> On Tue, Mar 2, 2010 at 2:01 PM, Stephan Hennig<[email protected]> >> wrote: >>> >>> The output of >>> >>> str = "abcde" >>> print(unicode.utf8.match(str, "()e")) >>> str = "Äabcde" >>> print(unicode.utf8.match(str, "()e")) >>> >>> is 5 and 7. The second one is obviously wrong. >> >> I believe 7 is ok, because in utf8 Äabcde is 7 octet long >> and unittest.c says >> NOTE: find positions are in bytes for all ctypes! > > Logicians might be satisfied with broken behaviour as long as it's > documented. I believe that it's not a broken behaviour, it's only a mix from two differents points of view: "abstract" (or "sign" or "glyph" o "character" ), where we see Ä as "unit" and "implementation" where Ä in utf8 is two octet.
>But I'm not a logician, so I cannot agree. :) To be honest I'm not confortable with regex and unicode. Perl can help here, but, just to see an example #> perl -e '$str = "Äabcde"; print length($str),"\n" ;' ; 7 #> perl -e 'use utf8; $str = "Äabcde"; print length($str),"\n" ;' ; 6 #> perl -v This is perl, v5.10.0 built for i586-linux-thread-multi Of course there are other libs,like http://site.icu-project.org/ http://www.pcre.org/pcre.txt and of course luatex can become bigger and slower . A solution can be a dynamic loading so one can choose at runtime what module to use --- but we must ensure that the same shared lib. is available for all systems, and this is not easy . -- luigi
