I wrote:
>  The problem is that if your domain is unbounded (i.e. the input strings 
>  are arbitrary and there are no constraints on their content or length), 
>  and you need a 1:1 mapping, then your range is unbounded too.  
>  That is, the strings are as efficient a representation as you're going to 
> get.  

If, on the other hand, if you can supply J with some information about your 
strings, you can get a more efficient representation of them. 

For example, if there is no restriction on the length of your strings, but you 
can provide a table of character frequencies (e.g. 'E' appears 12% of the time, 
'T' appears 9% of the time, etc) with some confidence, J can help you compress 
your strings.  Because you've provided more information, this will be a more 
efficient encoding than the naive method:   the compressed string can be 
represented by a smaller number than the uncompressed string.

I haven't quite gotten this to work yet, but here's my approach.  First, read 
Roger's essay on Huffman coding  
http://www.jsoftware.com/jwiki/Essays/Huffman_Coding  .  Then copy the 
"Collected Definitions" section into a new J script and save.  

Load that from another script, which will use the Huffman verbs to compress 
strings:

           lets  =.  ; {."1 P                      NB.  See [1] for definition 
of P
           freqs =.  0 ". '%'-.~;:^:_1 {:"1 P
           
           lets  =.  lets  , toupper lets
           freqs =.  freqs , freqs+0.1             NB.  hcode doesn't like 
repeated freqs?
           
           freqs =.  freqs (a. i. lets)} %100*2+i.#a.
           lets  =.  a.
           
           H     =.  freqs hcodes lets
           HA    =.  H;lets
           SB    =.  hst HA
           
           h     =:  HA&hencode :. (SB&hdecode)
           s2i   =:  #.@:x:@:h
           
           s2i 'Alex Rufon'
        4552781575577592135476                     

           ]&.s2i 'Alex Rufon'
        Alex Rufon

As you can see, the encoded number is two orders of magnitude smaller than 
naive method's, even using this buggy code.   You can increase this improvement 
(and, incidentally, address the bugs) by tailoring a Huffman table to your data 
set.  If you don't need all of  a.  then trim the list to just those characters 
you use.  If you have representative historical data, you can kill two birds 
with one stone; you can generate the list of characters you actually use and 
their relative frequencies [1]:

           P =: (~. ;"_1 #/.~) ; ALL_STRINGS_EVER_USED

But I doubt all this will be more efficient than  s:  (particularly since the 
encoded strings will likely require extended integers; even more so if you need 
to attach them to other numeric data, which will result in the promotion of the 
entire array to rational or extended precision).

-Dan

[1]  For the sake of this example, I just used the table from Wikipedia:

           P  =:  TAB cut&> }. <;._2 (noun define)
        From http://en.wikipedia.org/wiki/Letter_frequencies as of 2008-Oct-07
        a       8.167%
        b       1.492%
        c       2.782%
        d       4.253%
        e       12.702%
        f       2.228%
        g       2.015%
        h       6.094%
        i       6.966%
        j       0.153%
        k       0.772%
        l       4.025%
        m       2.406%
        n       6.749%
        o       7.507%
        p       1.929%
        q       0.095%
        r       5.987%
        s       6.327%
        t       9.056%
        u       2.758%
        v       0.978%
        w       2.360%
        x       0.150%
        y       1.974%
        z       0.074%
        )
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to