Brian Schott wrote: > Bill, > > Look below for my questions about your reply, > please. Thank you for your reply. > > On Sun, 27 Aug 2006, bill lam wrote: > + > + Brian, I don't understand either. > + > + But these definition are different although there look the same, > + > + This define unicode (ucs2) 3!:0=131072 > + neg =. 4&u:111 800 65 800 66 800 67 800 88 800 89 800 90 800 > + null=. 4&u:111 805 65 805 66 805 67 805 88 805 89 805 90 805 > + pos =. 4&u:111 799 65 799 66 799 67 799 88 799 89 799 90 799 > + > + but this is one byte, utf8 encoding that represent unicode, 3!:0=2 > + neg=.'o̠A̠B̠C̠X̠Y̠Z̠' > + null=.'o̥ḀB̥C̥X̥Y̥Z̥' > + pos=:'o̟A̟B̟C̟X̟Y̟Z̟' > > The previous 3 lines, especially the last 2 lines, > look like junk on my ascii-only email reader, as you can see > here. When I read the same three lines in FireFox (a > non-ascii-only email reader) the lines look better, but the > second line looks almost identical to the first except that > the underline characters in the first line are replaced with > single-character squares in the second line. The third line > looks good in FireFox, like the third line does in the .ijx > window with the font suggested by Istvan Kadar in his post. >
I do not have ms arial unicode font installed, so I use lucida console or lucida unicode instead. please refer to this image http://www.jsoftware.com/jwiki/BillLam/temp/ this line defines 14 unicode character (codepoint) neg =. u: 111 800 65 800 66 800 67 800 88 800 89 800 90 800 you can count: o _ A _ B _ .... , there is 14 characters displayed, the underbar symbol _ is displayed below baseline, but it is still one character. Notice that here the renderring is not perfect because the underbar symbol should be exact under the preceeding character similar to overstrike APL symbol. Thus 2 unicode character (codepoint) to represent 1 glyph. (Sometimes unicode standard might define 1 codepoint for a precomposite glyph). Anyway there are 14 (not 21) unicode characters inside "neg" in this example as confirmed by 3&u: neg 111 800 65 800 66 800 67 800 88 800 89 800 90 800 but if you define neg by typing or cut-and-paste to ijx/ijs, you are *not* working with unicode codepoint anymore. (assuming you can see the symbol here) neg1=. 'o̠A̠B̠C̠X̠Y̠Z̠' 3!:0 neg1 2 a.i. neg1 111 204 160 65 204 160 66 204 160 67 204 160 88 204 160 89 204 160 90 204 160 $neg1 21 neg1 itself is not unicode, it is a byte array that encodes unicode. ijx/jix assume you type everything using utf8 to represent unicode and do translation automatically, so that you can see the unicode symbols. correspondance between unicode and utf8 (<@(3&u:)"0 neg),:(a.&[EMAIL PROTECTED]&.>)neg +---+-------+--+-------+--+-------+--+-------+--+-------+--+-------+--+-------+ |111|800 |65|800 |66|800 |67|800 |88|800 |89|800 |90|800 | +---+-------+--+-------+--+-------+--+-------+--+-------+--+-------+--+-------+ |111|204 160|65|204 160|66|204 160|67|204 160|88|204 160|89|204 160|90|204 160| +---+-------+--+-------+--+-------+--+-------+--+-------+--+-------+--+-------+ you see every codepoint above 127 in neg will be encoded using 2 characters in utf8. In general a unicode codepoint may be represented by 1,2,3 or 4 bytes in uft8 encoding. Most han characters are represented by 3 bytes of utf8. ucp is a cover verb for 7&u:, similarly utf8 is for 8&u:, they should be defined in J stdlib.ijs, > + > + to convert to ucs2, use ucp > + neg=.ucp 'o̠A̠B̠C̠X̠Y̠Z̠' > + null=.ucp 'o̥ḀB̥C̥X̥Y̥Z̥' > + pos=:ucp 'o̟A̟B̟C̟X̟Y̟Z̟' > > Where is "ucp" found and using the keyboard how does > one produce the character strings in single quotes in the > previous 3 lines? I can only produce those strings with the > 4&u: verb, not directly with the keyboard. > J does not have build-in IME so I guess it depends on your os IME. A chinese IME that allow entering unicode directly by typing its hexadeciaml value but I seldom use it. I don't know how to do it on Mac. > + > + do not trust what you saw in ijx, use 3&u: instead to show the true data > + (similar to using a.&i. to display ascii) > > To confirm your admonition to use 3&u: I produced > the following three experiments. It appears that you are > correct and that 7 2&$ is preferable to 7 3&$ . > > 3 u: neg > 111 800 65 800 66 800 67 800 88 800 89 800 90 800 > 3 u: 7 3$neg > 111 800 65 > 800 66 800 > 67 800 88 > 800 89 800 > 90 800 111 > 800 65 800 > 66 800 67 > 3 u: 7 2$neg > 111 800 > 65 800 > 66 800 > 67 800 > 88 800 > 89 800 > 90 800 > I'm not sure if composite character (2 character for 1 glyph) specifically chosen to illustrate some idea else or not. I have no experience in this area as in chinese/japanese, 1 unicode codepoint = 1 han glyph. PS. I'm not sure if the terms like "unicode codepoint", "unicode character" or "glyph" are used correctly. It may actually mean the opposite, so that you better check them yourself. :-) -- regards, bill ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
