On 11 August 2013 07:24, Chris Angelico <ros...@gmail.com> wrote: > On Sun, Aug 11, 2013 at 7:17 AM, Joshua Landau <jos...@landau.ws> wrote: >> Given tweet = b"caf\x65\xCC\x81".decode(): >> >> >>> tweet >> 'café' >> >> But: >> >> >>> len(tweet) >> 5 > > You're now looking at the difference between glyphs and combining > characters. Twitter counts combining characters, so when you build one > "thing" out of lots of separately-typed parts, it does count as more > characters.
@https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character > The "café" issue mentioned above raises the question of how you count > the characters in the Tweet string "café". To the human eye the length is > clearly four characters. Depending on how the data is represented this > could be either five or six UTF-8 bytes. Twitter does not want to penalize > a user for the fact we use UTF-8 or for the fact that the API client in > question used the longer representation. Therefore, Twitter does count > "café" as four characters no matter which representation is sent. Which would imply that twitter doesn't count combining characters, even though the web interface seems to. > Read this article for some arguments on the subject, including a > number of references to Twitter itself: > > http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ I read that *last* time you pointed it out :P. It's a good link, though. -- Anyhow, it's good to know I haven't been obviously stupid with my understanding of Unicode. I learnt it all from this list anyway; wouldn't want to disappoint! -- http://mail.python.org/mailman/listinfo/python-list