Dan Bron wrote: > However, if J (scripts or primitives) encapsulated all the Unicode > knowledge I lack, that would've significantly sped me up. In particular, > one major tool I could use is a verb that would turn any reasonable input > into an array of characters (characters in the sense that # et al work). > Let's call this verb 'chars'. > > I think that means the output of chars is the Unicode datatype in J, and > the input is ASCII, utf8, or any other encoding (possibly as determined by > a BOM or lack of one). The key to chars is that I should not have to > specify, know, or even think about the format of that input.
A good start might be the unicode script: load 'unicode' Your chars verb might then be: chars=: ucp @ toutf8 For example: txt=. 'hello 香港' #txt 12 #chars txt 8 I understand that this is awkward compared to what you are used to. However, we all face the problem that early systems mapped bytes and characters one-to-one, but this can no longer be maintained. Utf8 is a natural solution for J, as elsewhere. True, there are slight inconveniences with using utf8, but there would be a lot more with any other representation. > PS: Regarding the definition of u: . If I understand it correctly, one > thing I like is the parsimony of 8&u: . That verb only produces Unicode > if it's required (there is a character code > 127). Otherwise it produces > ASCII. Perhaps we should emulate (or leverage) that behavior the > definition of chars . Note the stdlib definitions: ucp=: 7&u: utf8=: 8&u: > > PPS: Complaining of my recent troubles to a colleague, he pointed me at > http://www.rgagnon.com/gp/gp-0004.html which observes: > > Modern OS (eg.W2K, XP) are Unicode based, meaning that two bytes > are used for each character. This is true only for windows. Linux distributions moved to utf8 some years ago, and it was the success of Linux in handling utf8 that was the deciding factor for us. One fairly significant difference now between Windows and Linux is that in Linux, a text file is assumed to be utf8, so no BOM (byte order marker signature) is needed, whereas in Windows the BOM is needed, and this complicates file handling. ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
