Chris wrote:
>  I don't think it is right to say that unicode is a second class citizen 
>  in J.

You're right; "second class citizen" connotes criticism, and that wasn't my
intent.

>  just use  ucpcount  to get character count if 
>  desired, or convert to the alternative 2-byte 
>  unicode datatype.

However, the very fact that I have to use different (non-primitive) methods
to obtain character counts for Unicode definitely places it in a different
category from ASCII.  If I want the number of characters in an ASCII
string I just use  #  , like for any other array in J.

As a concrete example, look at my comments at 
http://www.rosettacode.org/wiki/Talk:Character_code#J_solution  .  Forget
the  ^:(_1^isLiteral)  parts.  Just consider the main verbs that convert
character codes (integers) into their corresponding characters.  For
ASCII, the conversion is simple, elegant, and exemplifies established J
patterns and ways of thinking:

      a.&i.           NB.  Index of input in list of all ASCII characters

For Unicode, the waters are murkier:

      3 u: 7 u: ]     NB.  ?

Not only is this longer, but it makes use of two arbitrary function codes
which must be looked up.  It also took me a long time (relative to  a.&i. 
) to find a solution that worked for all reasonable input, and yet I'm
still not confident I've covered everything.

>  What exactly are you struggling with? Can you provide a simple example?

I've already solved it; it just took me much longer than it would have, had
the data been in ASCII.  Don't get me wrong, the difference is mainly
attributable to my ignorance of Unicode (and character encoding issues in
general), not lack of support in J.

However, if J (scripts or primitives) encapsulated all the Unicode
knowledge I lack, that would've significantly sped me up.  In particular,
one major tool I could use is a verb that would turn any reasonable input
into an array of characters (characters in the sense that  #  et al work).
 Let's call this verb 'chars'.

I think that means the output of  chars  is the Unicode datatype in J, and
the input is ASCII, utf8, or any other encoding (possibly as determined by
a BOM or lack of one).  The key to  chars  is that I should not have to
specify, know, or even think about the format of that input.  

Inversely, I'd like to be able to take an array of the Unicode datatype and
convert it to any reasonable encoding; let's call this verb 'charConvert'.
 I'd also need an auxillary verb (say, 'charType') which would determine
the encoding of its input.  

With these 3 verbs, I could read in literal data in any encoding (without
knowing or caring what that is), manipulate and transform it like any
other J array, and then convert it back to its original encoding,
appropriate for consumption by the original producer.   All this while
maintaining my blissful ignorance of Unicode.

For example:

           charType myData         NB.  result could be ASCII, utf8, utf16, 
utf32,
whatever
        utf8
           
           datatype chars myData   NB.  datat...@chars is always 'unicode'
        unicode
        
           input -: (charType myData) charConvert chars myData  NB.  Tautology
        1
           
           char  =:  1 : 'charType charConvert u@:chars'   NB.  Utility adverb
means I never have to care about encodings again

Possibly these verbs or their equivalents already exist in the J standard
library or an addon.  The problem with that is I don't know how to find
them, short of find-in-files and searching the Wiki, which is time
consuming (and occasionally fruitless, if I use the wrong keywords to seek
what I want).  

On the other hand, I do know  u:  has something to do with Unicode.  So
usually I just type that, hit Ctrl-F1, and use that documentation to
construct a once-off solution.  So maybe the trouble is I should use my
time searching for existing solutions rather than trying to recreate them
from low-level primitives.

-Dan

PS:  Regarding the definition of  u:  .  If I understand it correctly, one
thing I like is the parsimony of  8&u:  .  That verb only produces Unicode
if it's required (there is a character code > 127).  Otherwise it produces
ASCII.  Perhaps we should emulate (or leverage) that behavior the
definition of  chars  .  

PPS:  Complaining of my recent troubles to a colleague, he pointed me at 
http://www.rgagnon.com/gp/gp-0004.html  which observes:

         Modern OS (eg.W2K, XP) are Unicode based, meaning that two bytes
         are used for each character. 

         However, certain text utilities may not understand Unicode, so if
         you want to create an ASCII version, type the following command: 

         From a DOS shell, type in : 

             C:\> type [unicode file name] > [text file name]

         The type command performs the conversion. The Unicode format is 
         detected from a special two-bytes signature at the beginning of
         a file (0xFF , 0xFE). The conversion includes the mapping af 
         accented characters too.

which will definitely come in handy (at least for me) next time I'm in a
Unicode-related bind.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to