Chris wrote: > I don't think it is right to say that unicode is a second class citizen > in J.
You're right; "second class citizen" connotes criticism, and that wasn't my intent. > just use ucpcount to get character count if > desired, or convert to the alternative 2-byte > unicode datatype. However, the very fact that I have to use different (non-primitive) methods to obtain character counts for Unicode definitely places it in a different category from ASCII. If I want the number of characters in an ASCII string I just use # , like for any other array in J. As a concrete example, look at my comments at http://www.rosettacode.org/wiki/Talk:Character_code#J_solution . Forget the ^:(_1^isLiteral) parts. Just consider the main verbs that convert character codes (integers) into their corresponding characters. For ASCII, the conversion is simple, elegant, and exemplifies established J patterns and ways of thinking: a.&i. NB. Index of input in list of all ASCII characters For Unicode, the waters are murkier: 3 u: 7 u: ] NB. ? Not only is this longer, but it makes use of two arbitrary function codes which must be looked up. It also took me a long time (relative to a.&i. ) to find a solution that worked for all reasonable input, and yet I'm still not confident I've covered everything. > What exactly are you struggling with? Can you provide a simple example? I've already solved it; it just took me much longer than it would have, had the data been in ASCII. Don't get me wrong, the difference is mainly attributable to my ignorance of Unicode (and character encoding issues in general), not lack of support in J. However, if J (scripts or primitives) encapsulated all the Unicode knowledge I lack, that would've significantly sped me up. In particular, one major tool I could use is a verb that would turn any reasonable input into an array of characters (characters in the sense that # et al work). Let's call this verb 'chars'. I think that means the output of chars is the Unicode datatype in J, and the input is ASCII, utf8, or any other encoding (possibly as determined by a BOM or lack of one). The key to chars is that I should not have to specify, know, or even think about the format of that input. Inversely, I'd like to be able to take an array of the Unicode datatype and convert it to any reasonable encoding; let's call this verb 'charConvert'. I'd also need an auxillary verb (say, 'charType') which would determine the encoding of its input. With these 3 verbs, I could read in literal data in any encoding (without knowing or caring what that is), manipulate and transform it like any other J array, and then convert it back to its original encoding, appropriate for consumption by the original producer. All this while maintaining my blissful ignorance of Unicode. For example: charType myData NB. result could be ASCII, utf8, utf16, utf32, whatever utf8 datatype chars myData NB. datat...@chars is always 'unicode' unicode input -: (charType myData) charConvert chars myData NB. Tautology 1 char =: 1 : 'charType charConvert u@:chars' NB. Utility adverb means I never have to care about encodings again Possibly these verbs or their equivalents already exist in the J standard library or an addon. The problem with that is I don't know how to find them, short of find-in-files and searching the Wiki, which is time consuming (and occasionally fruitless, if I use the wrong keywords to seek what I want). On the other hand, I do know u: has something to do with Unicode. So usually I just type that, hit Ctrl-F1, and use that documentation to construct a once-off solution. So maybe the trouble is I should use my time searching for existing solutions rather than trying to recreate them from low-level primitives. -Dan PS: Regarding the definition of u: . If I understand it correctly, one thing I like is the parsimony of 8&u: . That verb only produces Unicode if it's required (there is a character code > 127). Otherwise it produces ASCII. Perhaps we should emulate (or leverage) that behavior the definition of chars . PPS: Complaining of my recent troubles to a colleague, he pointed me at http://www.rgagnon.com/gp/gp-0004.html which observes: Modern OS (eg.W2K, XP) are Unicode based, meaning that two bytes are used for each character. However, certain text utilities may not understand Unicode, so if you want to create an ASCII version, type the following command: From a DOS shell, type in : C:\> type [unicode file name] > [text file name] The type command performs the conversion. The Unicode format is detected from a special two-bytes signature at the beginning of a file (0xFF , 0xFE). The conversion includes the mapping af accented characters too. which will definitely come in handy (at least for me) next time I'm in a Unicode-related bind. ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
