On Sun, 25 Nov 2012 22:12:33 +1100, Chris Angelico wrote: > On Sun, Nov 25, 2012 at 9:19 PM, kobayashi <pg.k...@gmail.com> wrote: >> Hello, >> >> Under platform that has fixed pitch font, I want to get a "screen" >> length of a multibyte string >> >> --- sample --- >> s1 = u"abcdef" >> s2 = u"あいう" # It has same "screen" length as s1's. print len(s1) # Got >> 6 >> print len(s2) # Got 3, but I want get 6. -------------- >> >> Abobe can get a "character" length of a multibyte string. Is there a >> way to get a "screen" length of a multibyte string? > > What do you mean by screen length? Do you mean the length in bytes? That > depends on your encoding. Do you mean width of the displayed version? > That depends on your font.
That's what I thought, but on doing some experimentation in my terminal, and doing some googling, I have come to the understanding that so-called monospaced (fixed-width) fonts may support *double column* characters as well as single column. So the OP's example has: s1 = u"abcdef" s2 = u"あいう" s1 has six single-column ("narrow") characters, while s2 has three double- column ("wide") characters, and both strings should take up the same horizontal space on screen. If you are reading this in a non-monospaced font, the width of each character is not fixed, the idea of columns doesn't really work, and the strings may not be the same width. See http://www.unicode.org/reports/tr11/tr11-19.html for more detail. Interestingly, Unicode supports wide versions of many non-EastAsian characters (presumably because pre-Unicode EastAsian encodings supported them). For example, run this code in Python: print u'\N{FULLWIDTH LATIN CAPITAL LETTER A}'; print u'AA' which should output: A AA If your font supports this, you should see a single "A" as wide as the double "AA" beneath it. Curiously, in the monospaced font I am using to type this, the "fullwidth" (wide, two-column) A is actually 2/3rds the width of the standard ("halfwidth", narrow, one-column) A. Font designers -- can't live with them, can't take them out and shoot them. Hans Mulder's suggestion: from unicodedata import east_asian_width def screen_length(s): return sum(2 if east_asian_width(c) == 'W' else 1 for c in s) is almost right. The Unicode document above states: [quote] In a broad sense, wide characters include W, F, and A (when in East Asian context), and narrow characters include N, Na, H, and A (when not in East Asian context). [end quote] from unicodedata import east_asian_width def columns(s, eastasian_context=True): if eastasian_context: wide = 'WFA' else: wide = 'WF' return sum(2 if east_asian_width(c) in wide else 1 for c in s) ought to do it for all but the most sophisticated text layout applications. For those needing much more sophistication, see here: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c -- Steven -- http://mail.python.org/mailman/listinfo/python-list