[issue24665] CJK support for textwrap

Terry J. Reedy Sun, 08 Jul 2018 15:52:26 -0700


Terry J. Reedy <tjre...@udel.edu> added the comment:


I think that this issue should be closed, as it is based on some confusions and 
errors.

Textwrap works in terms of characters.  The wrap method "wraps the single 
paragraph in text (a string) so every line is at most width characters long."  
When the module was written, 'character' meant "printable ascii (or 'extended 
ascii') character".  It now means 'unicode codepoint'.  Both are mentally real 
abstractions but have no particular correspondence to physical length.  Calling 
textwrap buggy because it works in characters is wrong.

Translating 'character' to 'fixed-width character space', so that one can 
measure physical length in terms of 'spaces' as a physical unit, is exact if 
and only if all characters are displayed in the same width space.  This is true 
for fixed pitch output devices that simulate typewriters and text that only 
used fixed-width characters from a fixed pitch font.  For long lines, the 
translation for a variable-pitch fonts may or may not be good enough for a 
particular use.

As David noted, textwrap already fails for Ascii control characters.  And it 
does cause problem when they are used in wrapped text. They are coded with 2 or 
4 characters on input and may display as 0, 1, (possibly 2), 4, or 5 characters 
on output, depending on the display code and the display 'device'.  As to the 
latter, "print('x\ax')" displays as 'xx' in a Windows console and as  something 
like xx in a tkinter Text widget, except that the numbers in the box here on 
Firefox are not present, so that the tk box is (sort-of) the same width as 'x'.

The particular premise of this issue is that CJK characters are somehow special 
and that 2.x releases, and now 3.x releases, are particularly broken for CJK 
text.  Not so.  If one has text that only uses same-width characters in a 
fixed-pitch CJK font (including wide spaces so columns line up), then textwrap 
works as well as it does for any other fixed-pitch text (ie, Ascii or Latin1).  
If one wants lines of a particular physical width, one passes a character width 
argument that corresponds to the desired physical width.

The following is based on what I see in IDLE's Settings dialog Font page font 
sample for Windows 10 'Source Code Pro'.  It includes samples from 12 
'alphabet's  To view it, run 'python -m idlelib', and on the top menu click 
Options => Configure IDLE.  When the selected font is not a full BMP unicode 
font, Tk and Windows use other fonts, scaled to the same height, to synthesize 
a fairly complete BMP 'font'.  The [Help] text says a bit more but has a 
mistake.  What I see:

Font size corresponds to physical height.  Hence, the lines are very close to 
the same height.  Some fonts look smaller or larger because they specify more 
or less blank space between lines.  One factor is the use of descenders, as in 
Arabic.

Character width for a fixed height varies.  20 characters in Greek, IPA, 
Hebrew, and Arabic take progressive less physical space than 20 Ascii or Latin1 
characters.  20 characters in Devanagari, Cyrillic, and Tamil take 
progressively more. (The Tamil line only has 14 chars.)  None of these are 
obviously fixed pitch.

The Chinese, Korean, and Japanese samples have a fixed pitch.  The characters 
are *not* actually 'double-wide', at least no relative to most other languages. 
 The 10 CJK characters are as wide as 16 Source Code Pro characters.  To match 
the physical width of 72 Ascii spaces, one should pass 'width=45'.

But note that the exact ratio for Ascii depends on the font. It is a little 
higher for Courier and Lucida Console.  It ranges from about 1 (for Arabic) to 
2 (for Tamil) for other languages.  The first 10 Tamil characters are slightly 
wider than the 10 CJK characters, so counting each CJK character as two average 
Tamil character is completely wrong.

My conclusion: the proposal is unnecessary for pure CJK text; it is wrong in 
hard-coding a fix only for CJK; the CJK fix is wrong in hard-coding a 
particular ratio, in particular, one that is at the extreme end of the range of 
possibilities.  Therefore, I think the open PR should be closed.  I also think 
this issue should be closed in favor of #12499, which proposed to allow users 
to pass a transform function suitable for their particular use case.  If that 
is implemented, and we decide to then add some sample functions, or rather, 
function factories, and to include one specifically for CJK*, then a new PR 
will be needed,  and a new issue would be appropriate.

* A more generic function factory for text with characters of two width classes 
might have as inputs a condition to identify '2nd language characters' and 
their fixed or average width relative relative to the 'first' language.

----------
nosy: +terry.reedy
title: Use unicodedata.east_asian_width in textwrap -> CJK support for textwrap

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue24665>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24665] CJK support for textwrap

Reply via email to