Brian Schott wrote:
> Bill,
> 
>       Look below for my questions about your reply,
> please. Thank you for your reply.
> 
> On Sun, 27 Aug 2006, bill lam wrote:
> +
> + Brian, I don't understand either.
> +
> + But these definition are different although there look the same,
> +
> + This define unicode (ucs2) 3!:0=131072
> +     neg =. 4&u:111 800 65 800 66 800 67 800 88 800 89 800 90 800
> +     null=. 4&u:111 805 65 805 66 805 67 805 88 805 89 805 90 805
> +     pos =. 4&u:111 799 65 799 66 799 67 799 88 799 89 799 90 799
> +
> + but this is one byte, utf8 encoding that represent unicode, 3!:0=2
> +   neg=.'o̠A̠B̠C̠X̠Y̠Z̠'
> +   null=.'o̥ḀB̥C̥X̥Y̥Z̥'
> +   pos=:'o̟A̟B̟C̟X̟Y̟Z̟'
> 
>       The previous 3 lines, especially the last 2 lines,
> look like junk on my ascii-only email reader, as you can see
> here. When I read the same three lines in FireFox (a
> non-ascii-only email reader) the lines look better, but the
> second line looks almost identical to the first except that
> the underline characters in the first line are replaced with
> single-character squares in the second line. The third line
> looks good in FireFox, like the third line does in the .ijx
> window with the font suggested by Istvan Kadar in his post.
> 

I do not have ms arial unicode font installed, so I use lucida console or lucida
unicode instead. please refer to this image
http://www.jsoftware.com/jwiki/BillLam/temp/

this line defines 14 unicode character (codepoint)
neg =. u: 111 800 65 800 66 800 67 800 88 800 89 800 90 800

you can count: o _ A _ B _ .... , there is 14 characters displayed, the underbar
symbol _ is displayed below baseline, but it is still one character. Notice that
here the renderring is not perfect because the underbar symbol should be exact
under the preceeding character similar to overstrike APL symbol. Thus 2 unicode
character (codepoint) to represent 1 glyph. (Sometimes unicode standard might
define 1 codepoint for a precomposite glyph). Anyway there are 14 (not 21)
unicode characters inside "neg" in this example as confirmed by
  3&u: neg
111 800 65 800 66 800 67 800 88 800 89 800 90 800

but if you define neg by typing or cut-and-paste to ijx/ijs, you are *not*
working with unicode codepoint anymore. (assuming you can see the symbol here)
 neg1=. 'o̠A̠B̠C̠X̠Y̠Z̠'

   3!:0 neg1
2
   a.i. neg1
111 204 160 65 204 160 66 204 160 67 204 160 88 204 160 89 204 160 90 204 160
   $neg1
21

neg1 itself is not unicode, it is a byte array that encodes unicode. ijx/jix
assume you type everything using utf8 to represent unicode and do translation
automatically, so that you can see the unicode symbols.

correspondance between unicode and utf8
   (<@(3&u:)"0 neg),:(a.&[EMAIL PROTECTED]&.>)neg
+---+-------+--+-------+--+-------+--+-------+--+-------+--+-------+--+-------+
|111|800    |65|800    |66|800    |67|800    |88|800    |89|800    |90|800    |
+---+-------+--+-------+--+-------+--+-------+--+-------+--+-------+--+-------+
|111|204 160|65|204 160|66|204 160|67|204 160|88|204 160|89|204 160|90|204 160|
+---+-------+--+-------+--+-------+--+-------+--+-------+--+-------+--+-------+

you see every codepoint above 127 in neg will be encoded using 2 characters in
utf8. In general a unicode codepoint may be represented by 1,2,3 or 4 bytes in
uft8 encoding. Most han characters are represented by 3 bytes of utf8.

ucp is a cover verb for 7&u:, similarly utf8 is for 8&u:, they should be defined
 in J stdlib.ijs,

> +
> + to convert to ucs2, use ucp
> +   neg=.ucp 'o̠A̠B̠C̠X̠Y̠Z̠'
> +   null=.ucp 'o̥ḀB̥C̥X̥Y̥Z̥'
> +   pos=:ucp 'o̟A̟B̟C̟X̟Y̟Z̟'
> 
>       Where is "ucp" found and using the keyboard how does
> one produce the character strings in single quotes in the
> previous 3 lines?  I can only produce those strings with the
> 4&u: verb, not directly with the keyboard.
> 

J does not have build-in IME so I guess it depends on your os IME. A chinese IME
that allow entering unicode directly by typing its hexadeciaml value but I
seldom use it. I don't know how to do it on Mac.

> +
> + do not trust what you saw in ijx, use 3&u: instead to show the true data
> + (similar to using a.&i. to display ascii)
> 
>       To confirm your admonition to use 3&u: I produced
> the following three experiments. It appears that you are
> correct and that 7 2&$ is preferable to 7 3&$ .
> 
>    3 u: neg
> 111 800 65 800 66 800 67 800 88 800 89 800 90 800
>    3 u: 7 3$neg
> 111 800  65
> 800  66 800
>  67 800  88
> 800  89 800
>  90 800 111
> 800  65 800
>  66 800  67
>    3 u: 7 2$neg
> 111 800
>  65 800
>  66 800
>  67 800
>  88 800
>  89 800
>  90 800
> 

I'm not sure if composite character (2 character for 1 glyph) specifically
chosen to illustrate some idea else or not. I have no experience in this area as
in chinese/japanese, 1 unicode codepoint = 1 han glyph.

PS. I'm not sure if the terms like "unicode codepoint", "unicode character" or
"glyph" are used correctly. It may actually mean the opposite, so that you
better check them yourself. :-)

-- 
regards,
bill
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to