I'm working on the code.
In the mean time, here is the code for calculating display width:
First you need to save the text file at
http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt
===============================================
require 'regex jfiles'
t=: 1!:1 <'EastAsianWidth.txt'
point=:'^([0-9A-F]{4});(Na|N|H|A|W|F)' rxmatches t
range=:'([0-9A-F]{1,4})\.\.([0-9A-F]{1,4});(Na|N|H|A|W|F)' rxmatches
t
jcreate 'unidatapoint'
(< }."1 point rxfrom t) jappend 'unidatapoint'
jcreate 'unidatarange'
(< }."1 range rxfrom t) jappend 'unidatarange'
===============================================
Now you have unidatapoint.ijf and unidatarange.ijf and are able to
use them.
===============================================
require 'jfiles'
NB. N : half
NB. Na : half
NB. H : half
NB. A : half
NB. F : full
NB. W : full
widthcode=:;: 'N Na H A F W'
pod=:>jread 'unidatapoint';0
rad=:>jread 'unidatarange';0
towc=: widthcode&i. NB. towidthcode
dfh=. 16&#. @ ('0123456789ABCDEF'&i.)
po=:(dfh each {."1 pod),. <"0 towc"0 {:"1 pod
ra=:(,&.>/"1 dfh each 2&{."1 rad),. <"0 towc"0 {:"1 rad
poa=:>{."1 po
fill=: 4 : 0
'r c'=.x
r=. ({.r)+ i. >: -~/ r
({.c) r}y
)
tab=:65536$0 NB. missing is N
tab=:(> {:"1 po) poa} tab
tab=:>./ ra fill"1 tab
diswid=: [: >: [: 4&<: [: {&tab 3&u:@ucp NB.for rank 1
================================================
For performance improvement, you could save tab using jfile and use
it. Also, you could use more compact representation(using 3 bits to
represent each character and compress the data).
Usage Example:
diswid 'í.oê¸?ab!â"?'
2 2 1 1 1 1
(,:~ ((ucp'-') $~ +/@diswid)) ucp 'í.oê¸?ab!-' NB. properly
showing
the top line in fixed-pitch font
--------
í.oê¸?ab!-
2007/2/13, Eric Iverson <[EMAIL PROTECTED]>:
> The problem of proper display of boxed unicode data is an
> interesting
> one. The first step to getting this fixed is for someone to provide
> a
> working J model that takes an arbitrary boxed argument and produces
> the
> character stream that properly displays it. If we had such a model
> we
> might consider incorporating it into the JE.
>
> ----- Original Message -----
> From: "June Kim" <[EMAIL PROTECTED]>
> To: "General forum" <[email protected]>
> Sent: Sunday, February 11, 2007 5:11 AM
> Subject: Re: [Jgeneral] wd 'set ...' with box draw characters
>
>
> > 2007/2/11, Chris Burke <[EMAIL PROTECTED]>:
> >> June Kim wrote:
> > [snip]
> >> > Second, the box is broken with different width characters(that
> >> > is,
> >> > when the length of bytes of the encoding, and the width of the
> >> > characters on display don't match). What is the usual way of
> >> > solving
> >> > it in other programming languages? There is a unicode standard
> >> > for
> >> > character widths. http://unicode.org/reports/tr11/
> >> >
> >> > Python implements that standard(along with others) in
> >> > unicodedata
> >> > module.
> >> >
> >> >>>> unicodedata.east_asian_width(u'í.o')
> >> > 'W'
> >> >>>> unicodedata.east_asian_width(u'a')
> >> > 'Na'
> >> >
> >> > (u specifies the following string is unicode. east_asian_width
> >> > returns
> >> > the width of the character, not only for east asian characters
> >> > but
> >> > all
> >> > unicode characters; it's got a narrow name due to its history)
> >> >
> > [snip]
> >>
> >> If you are having problems with display, it is because of the
> >> font,
> >> not
> >> because we are not using unicode.
> > [snip]
> >
> > When a string is boxed and the string includes characters that
> > have
> > different width to the byte lenghts, then the box is broken in J.
> > It
> > is not because of the font. It is because J makes an assumption
> > that
> > every character's width is same with its byte length, which is
> > obviously false in many writting+encoding systems, including east
> > asians. We can definitely say J's box display isn't
> > internationalized
> > yet.
> >
> > For example, 54620 (in unicode code point) is a Korean character,
> > which is pronounced as "han". It's width is "Wide"(twice wide as
> > latin
> > alphabets)
> >
> > han=.4 u: 54620
> > <han
> > +---+
> > |í.o|
> > +---+
> > <8 u: han
> > +---+
> > |í.o|
> > +---+
> >
> > Since J counts the byte length for determining character's width,
> > and
> > the byte length for han is 3 in UTF-8( 3-: #8 u: han ), the box's
> > horizontal character '-'(of which width is "Narrow") is printed
> > three
> > times, and on the display the box is broken.