Am 02.03.2010 17:18, schrieb luigi scarso:
On Tue, Mar 2, 2010 at 4:39 PM, Stephan Hennig<[email protected]>  wrote:
Am 02.03.2010 14:41, schrieb luigi scarso:

I believe 7 is ok, because in utf8 Äabcde is 7 octet long
and  unittest.c says
  NOTE: find positions are in bytes for all ctypes!

Logicians might be satisfied with broken behaviour as long as it's
documented.
I believe that it's not a broken behaviour, it's only  a mix from two
differents points of view:
"abstract" (or "sign"  or "glyph" o "character" ),  where we see Ä  as "unit"
and "implementation"  where Ä in utf8  is two octet.

Yes, that's why I call it "broken". Switching point of view within the unicode.utf8 functions doesn't seem a good design to me. I cannot see why it could be sensible to regard the length of Ä as one (character) in len and two (octets) in find. After all, we already have function(s) that return byte positions in a strings, string.find or unicode.ascii.find. Why not drop unicode.utf8.find at all? That'd be a clear design. (Only beaten by a find function that regards Ä the same length as len does. There are use-cases for such a find function.)


But I'm not a logician, so I cannot agree. :)
To be honest I'm not confortable with regex and unicode.

Perl can help here, but, just to see an example

#>  perl  -e '$str = "Äabcde"; print length($str),"\n" ;' ;
7
#>  perl  -e 'use utf8; $str = "Äabcde"; print length($str),"\n" ;' ;
6

Same with string.len and unicode.ut8.len in Lua. You made me curious. Is there a find function in Perl? What values does that return?

Best regards,
Stephan Hennig

Reply via email to