I wrote: > The problem is that if your domain is unbounded (i.e. the input strings > are arbitrary and there are no constraints on their content or length), > and you need a 1:1 mapping, then your range is unbounded too. > That is, the strings are as efficient a representation as you're going to > get.
If, on the other hand, if you can supply J with some information about your strings, you can get a more efficient representation of them. For example, if there is no restriction on the length of your strings, but you can provide a table of character frequencies (e.g. 'E' appears 12% of the time, 'T' appears 9% of the time, etc) with some confidence, J can help you compress your strings. Because you've provided more information, this will be a more efficient encoding than the naive method: the compressed string can be represented by a smaller number than the uncompressed string. I haven't quite gotten this to work yet, but here's my approach. First, read Roger's essay on Huffman coding http://www.jsoftware.com/jwiki/Essays/Huffman_Coding . Then copy the "Collected Definitions" section into a new J script and save. Load that from another script, which will use the Huffman verbs to compress strings: lets =. ; {."1 P NB. See [1] for definition of P freqs =. 0 ". '%'-.~;:^:_1 {:"1 P lets =. lets , toupper lets freqs =. freqs , freqs+0.1 NB. hcode doesn't like repeated freqs? freqs =. freqs (a. i. lets)} %100*2+i.#a. lets =. a. H =. freqs hcodes lets HA =. H;lets SB =. hst HA h =: HA&hencode :. (SB&hdecode) s2i =: #.@:x:@:h s2i 'Alex Rufon' 4552781575577592135476 ]&.s2i 'Alex Rufon' Alex Rufon As you can see, the encoded number is two orders of magnitude smaller than naive method's, even using this buggy code. You can increase this improvement (and, incidentally, address the bugs) by tailoring a Huffman table to your data set. If you don't need all of a. then trim the list to just those characters you use. If you have representative historical data, you can kill two birds with one stone; you can generate the list of characters you actually use and their relative frequencies [1]: P =: (~. ;"_1 #/.~) ; ALL_STRINGS_EVER_USED But I doubt all this will be more efficient than s: (particularly since the encoded strings will likely require extended integers; even more so if you need to attach them to other numeric data, which will result in the promotion of the entire array to rational or extended precision). -Dan [1] For the sake of this example, I just used the table from Wikipedia: P =: TAB cut&> }. <;._2 (noun define) From http://en.wikipedia.org/wiki/Letter_frequencies as of 2008-Oct-07 a 8.167% b 1.492% c 2.782% d 4.253% e 12.702% f 2.228% g 2.015% h 6.094% i 6.966% j 0.153% k 0.772% l 4.025% m 2.406% n 6.749% o 7.507% p 1.929% q 0.095% r 5.987% s 6.327% t 9.056% u 2.758% v 0.978% w 2.360% x 0.150% y 1.974% z 0.074% ) ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
