Dan Bron wrote:
> However, if J (scripts or primitives) encapsulated all the Unicode
> knowledge I lack, that would've significantly sped me up.  In particular,
> one major tool I could use is a verb that would turn any reasonable input
> into an array of characters (characters in the sense that  #  et al work).
>  Let's call this verb 'chars'.
> 
> I think that means the output of  chars  is the Unicode datatype in J, and
> the input is ASCII, utf8, or any other encoding (possibly as determined by
> a BOM or lack of one).  The key to  chars  is that I should not have to
> specify, know, or even think about the format of that input.  

A good start might be the unicode script:

  load 'unicode'

Your chars verb might then be:

  chars=: ucp @ toutf8

For example:

   txt=. 'hello 香港'
   #txt
12
   #chars txt
8

I understand that this is awkward compared to what you are used to.
However, we all face the problem that early systems mapped bytes and
characters one-to-one, but this can no longer be maintained. Utf8 is a
natural solution for J, as elsewhere. True, there are slight
inconveniences with using utf8, but there would be a lot more with any
other representation.

> PS:  Regarding the definition of  u:  .  If I understand it correctly, one
> thing I like is the parsimony of  8&u:  .  That verb only produces Unicode
> if it's required (there is a character code > 127).  Otherwise it produces
> ASCII.  Perhaps we should emulate (or leverage) that behavior the
> definition of  chars  .  

Note the stdlib definitions:

ucp=: 7&u:
utf8=: 8&u:

> 
> PPS:  Complaining of my recent troubles to a colleague, he pointed me at 
> http://www.rgagnon.com/gp/gp-0004.html  which observes:
> 
>          Modern OS (eg.W2K, XP) are Unicode based, meaning that two bytes
>          are used for each character. 

This is true only for windows. Linux distributions moved to utf8 some
years ago, and it was the success of Linux in handling utf8 that was the
deciding factor for us. One fairly significant difference now between
Windows and Linux is that in Linux, a text file is assumed to be utf8,
so no BOM (byte order marker signature) is needed, whereas in Windows
the BOM is needed, and this complicates file handling.

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to