The problem of proper display of boxed unicode data is an interesting
one. The first step to getting this fixed is for someone to provide a
working J model that takes an arbitrary boxed argument and produces the
character stream that properly displays it. If we had such a model we
might consider incorporating it into the JE.
----- Original Message -----
From: "June Kim" <[EMAIL PROTECTED]>
To: "General forum" <[email protected]>
Sent: Sunday, February 11, 2007 5:11 AM
Subject: Re: [Jgeneral] wd 'set ...' with box draw characters
2007/2/11, Chris Burke <[EMAIL PROTECTED]>:
June Kim wrote:
[snip]
> Second, the box is broken with different width characters(that is,
> when the length of bytes of the encoding, and the width of the
> characters on display don't match). What is the usual way of
> solving
> it in other programming languages? There is a unicode standard for
> character widths. http://unicode.org/reports/tr11/
>
> Python implements that standard(along with others) in unicodedata
> module.
>
>>>> unicodedata.east_asian_width(u'한')
> 'W'
>>>> unicodedata.east_asian_width(u'a')
> 'Na'
>
> (u specifies the following string is unicode. east_asian_width
> returns
> the width of the character, not only for east asian characters but
> all
> unicode characters; it's got a narrow name due to its history)
>
[snip]
If you are having problems with display, it is because of the font,
not
because we are not using unicode.
[snip]
When a string is boxed and the string includes characters that have
different width to the byte lenghts, then the box is broken in J. It
is not because of the font. It is because J makes an assumption that
every character's width is same with its byte length, which is
obviously false in many writting+encoding systems, including east
asians. We can definitely say J's box display isn't internationalized
yet.
For example, 54620 (in unicode code point) is a Korean character,
which is pronounced as "han". It's width is "Wide"(twice wide as latin
alphabets)
han=.4 u: 54620
<han
+---+
|한|
+---+
<8 u: han
+---+
|한|
+---+
Since J counts the byte length for determining character's width, and
the byte length for han is 3 in UTF-8( 3-: #8 u: han ), the box's
horizontal character '-'(of which width is "Narrow") is printed three
times, and on the display the box is broken.
--------------------------------------------------------------------------------
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm