Tom Christiansen <tchr...@perl.com> added the comment:

> Martin v. Löwis <mar...@v.loewis.de> added the comment:

> I think the WideCharToMultibyte approach is just incorrect.

> I'm -1 on using wcswidth, though. 

Like you, I too seriously question using wcswidth() for this at all:

    The wcswidth() function either shall return 0 (if pwcs points to a
    null wide-character code), or return the number of column positions
    to be occupied by the wide-character string pointed to by pwcs, or
    return -1 (if any of the first n wide-character codes in the wide-
    character string pointed to by pwcs is not a printable wide-
    character code).

I would be willing to bet (a small amount of) money it does not correctly
inplmented Unicode print widths, even though one would certainly *think* it
does according to this:

     The wcswidth() function determines the number of column positions
     required for the first n characters of pwcs, or until a null wide
     character (L'\0') is encountered.

There are a bunch of "interesting" cases I would want it tested against.

> We already have unicodedata.east_asian_width, which implements 
> http://unicode.org/reports/tr11/ 

> The outcomes of this function are these:
> - F: full-width, width 2, compatibility character for a narrow char
> - H: half-width, width 1, compatibility character for a narrow char
> - W: wide, width 2
> - Na: narrow, width 1
> - A: ambiguous; width 2 in Asian context, width 1 in non-Asian context
> - N: neutral; not used in Asian text, so has no width. Practically, width can 
> be considered as 1

Um, East_Asian_Width=Ambiguous (EA=A) isn't actually good enough for this.
And EA=N cannot be consider 1, either.

For example, some of the Marks are EA=A and some are EA=N, yet how may
print columns they take varies.  It is usually 0, but can be 1 at the start
of the file/string or immediately after a linebreak sequence.  Then there
are things like the variation selectors which are never anything.

Now consider the many \pC code points, like 

    U+0009  CHARACTER TABULATION
    U+00AD  SOFT HYPHEN 
    U+200C  ZERO WIDTH NON-JOINER
    U+FEFF  ZERO WIDTH NO-BREAK SPACE
    U+2062  INVISIBLE TIMES

A TAB is its own problem but SHY we know is only width=1 immediately
before a linebreak or EOF, and ZWNJ and ZWNBSP are both certainly
width=0.  So are the INVISIBLE * code points.

Context:

Imagine you're trying to format a string so that it takes up exactly 20
columns: you need to know how many spaces to pad it with based on the
print width.  That is what the #12568 is needing
to do, and you have to do much more than East Asian Width properties.

I really do think that what #12568 is asking for is to have the equivalent
of the Perl Unicode::GCString's columns() method, and that you aren't going
to be able to handle text alignment of Unicode with anything that is much
less of that.  After all, #12568's title is "Add functions to get the width
in columns of a character".  I would very much like to compare what
columns() thinks compared with what wcswidth() thinks.  I bet wcswidth() is
very simple-minded at best.

I may of course be wrong.

--tom

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12568>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to