Yes, Bill. It simply treats each UTF-8 byte as a letter keeping the parts
of the UTF-8 character code together. ;: normally treats each byte of a
UTF-8 character as a special character like + or -. Then strange things
happen sometimes resulting in an invalid character displayed or guessing
the equivalent unicode point for the one byte as before unicode existed.
It's not very desirable. The bytes of UTF-8 in a comment or literal are
kept together giving the expected result. This simply makes the UTF-8 code
bytes work the same way outside of comments or literals.

As I stated earlier, I don't know where J intends to go with unicode. This
could open up the discussion of support of APL characters as primitives
again. I hope not. Personally I like the strictly ASCII definitions of
primitives. But I noticed in the Android version of J it supports UTF-8
names like setting iota to i. and using it as a named verb. Not sure what
to think of that. Unicode points are a mixture of characters in many
languages tokens and symbols. I don't see any order in distinguishing
between them.

One of the things that the support of line feeds in ;: provides it to
process all kinds of non-J data which might include UTF-8 not as part of a
literal or J comment. Adding this support for UTF-8 might make ;: more
useful for such data.

Unicode and UTF-x are problematic to deal with in J as UTF-x codes may take
more than one item and J does not deal with multiple items for a character.
When I deal with text that may include UTF-8 as from the internet I
immediately convert it to UTF-16 or UTF-32 hoping to avoid multiple items
representing a character. To me that is much easier than having to code
around multiple items representing a character.

This seemed to me as a simple way to get ;: to handle UTF-8 as one might
expect.

Just a thought.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to