Terry J. Reedy <[email protected]> added the comment:
I think that this issue should be closed, as it is based on some confusions and
errors.
Textwrap works in terms of characters. The wrap method "wraps the single
paragraph in text (a string) so every line is at most width characters long."
When the module was written, 'character' meant "printable ascii (or 'extended
ascii') character". It now means 'unicode codepoint'. Both are mentally real
abstractions but have no particular correspondence to physical length. Calling
textwrap buggy because it works in characters is wrong.
Translating 'character' to 'fixed-width character space', so that one can
measure physical length in terms of 'spaces' as a physical unit, is exact if
and only if all characters are displayed in the same width space. This is true
for fixed pitch output devices that simulate typewriters and text that only
used fixed-width characters from a fixed pitch font. For long lines, the
translation for a variable-pitch fonts may or may not be good enough for a
particular use.
As David noted, textwrap already fails for Ascii control characters. And it
does cause problem when they are used in wrapped text. They are coded with 2 or
4 characters on input and may display as 0, 1, (possibly 2), 4, or 5 characters
on output, depending on the display code and the display 'device'. As to the
latter, "print('x\ax')" displays as 'xx' in a Windows console and as something
like xx in a tkinter Text widget, except that the numbers in the box here on
Firefox are not present, so that the tk box is (sort-of) the same width as 'x'.
The particular premise of this issue is that CJK characters are somehow special
and that 2.x releases, and now 3.x releases, are particularly broken for CJK
text. Not so. If one has text that only uses same-width characters in a
fixed-pitch CJK font (including wide spaces so columns line up), then textwrap
works as well as it does for any other fixed-pitch text (ie, Ascii or Latin1).
If one wants lines of a particular physical width, one passes a character width
argument that corresponds to the desired physical width.
The following is based on what I see in IDLE's Settings dialog Font page font
sample for Windows 10 'Source Code Pro'. It includes samples from 12
'alphabet's To view it, run 'python -m idlelib', and on the top menu click
Options => Configure IDLE. When the selected font is not a full BMP unicode
font, Tk and Windows use other fonts, scaled to the same height, to synthesize
a fairly complete BMP 'font'. The [Help] text says a bit more but has a
mistake. What I see:
Font size corresponds to physical height. Hence, the lines are very close to
the same height. Some fonts look smaller or larger because they specify more
or less blank space between lines. One factor is the use of descenders, as in
Arabic.
Character width for a fixed height varies. 20 characters in Greek, IPA,
Hebrew, and Arabic take progressive less physical space than 20 Ascii or Latin1
characters. 20 characters in Devanagari, Cyrillic, and Tamil take
progressively more. (The Tamil line only has 14 chars.) None of these are
obviously fixed pitch.
The Chinese, Korean, and Japanese samples have a fixed pitch. The characters
are *not* actually 'double-wide', at least no relative to most other languages.
The 10 CJK characters are as wide as 16 Source Code Pro characters. To match
the physical width of 72 Ascii spaces, one should pass 'width=45'.
But note that the exact ratio for Ascii depends on the font. It is a little
higher for Courier and Lucida Console. It ranges from about 1 (for Arabic) to
2 (for Tamil) for other languages. The first 10 Tamil characters are slightly
wider than the 10 CJK characters, so counting each CJK character as two average
Tamil character is completely wrong.
My conclusion: the proposal is unnecessary for pure CJK text; it is wrong in
hard-coding a fix only for CJK; the CJK fix is wrong in hard-coding a
particular ratio, in particular, one that is at the extreme end of the range of
possibilities. Therefore, I think the open PR should be closed. I also think
this issue should be closed in favor of #12499, which proposed to allow users
to pass a transform function suitable for their particular use case. If that
is implemented, and we decide to then add some sample functions, or rather,
function factories, and to include one specifically for CJK*, then a new PR
will be needed, and a new issue would be appropriate.
* A more generic function factory for text with characters of two width classes
might have as inputs a condition to identify '2nd language characters' and
their fixed or average width relative relative to the 'first' language.
----------
nosy: +terry.reedy
title: Use unicodedata.east_asian_width in textwrap -> CJK support for textwrap
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue24665>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com