Am 02.03.2010 17:18, schrieb luigi scarso:
On Tue, Mar 2, 2010 at 4:39 PM, Stephan Hennig<[email protected]> wrote:
Am 02.03.2010 14:41, schrieb luigi scarso:
I believe 7 is ok, because in utf8 Äabcde is 7 octet long
and unittest.c says
NOTE: find positions are in bytes for all ctypes!
Logicians might be satisfied with broken behaviour as long as it's
documented.
I believe that it's not a broken behaviour, it's only a mix from two
differents points of view:
"abstract" (or "sign" or "glyph" o "character" ), where we see Ä as "unit"
and "implementation" where Ä in utf8 is two octet.
Yes, that's why I call it "broken". Switching point of view within the
unicode.utf8 functions doesn't seem a good design to me. I cannot see
why it could be sensible to regard the length of Ä as one (character) in
len and two (octets) in find. After all, we already have function(s)
that return byte positions in a strings, string.find or
unicode.ascii.find. Why not drop unicode.utf8.find at all? That'd be a
clear design. (Only beaten by a find function that regards Ä the same
length as len does. There are use-cases for such a find function.)
But I'm not a logician, so I cannot agree. :)
To be honest I'm not confortable with regex and unicode.
Perl can help here, but, just to see an example
#> perl -e '$str = "Äabcde"; print length($str),"\n" ;' ;
7
#> perl -e 'use utf8; $str = "Äabcde"; print length($str),"\n" ;' ;
6
Same with string.len and unicode.ut8.len in Lua. You made me curious.
Is there a find function in Perl? What values does that return?
Best regards,
Stephan Hennig